Who chooses? (was Re: content, format, ctype, or xtype ?)

Wed May 13 04:31:10 PDT 2009

(completed version of the message I sent in error a few minutes ago)

On Tue, 12 May 2009, Rob Seaman wrote:

> I'm snow-blind with the blizzard of messages today.  Returning to Mark's use
> case from yesterday...
> 
> On May 11, 2009, at 5:48 AM, Mark Taylor wrote:
> 
> > The use case which I have in mind (and I think Doug is thinking along
> > similar lines) is this: a user acquires a VOTable from some source -
> > perhaps TAP, perhaps not.  It contains a column X whose contents
> > is a string in iso-8601 format - this is perhaps identified by
> > utype with part of the STC data model, or with some other data model,
> > or perhaps is not.  The user loads the table into TOPCAT
> > (or some other generic table handling software) and wants to make a
> > plot with column X as one of the axes.
> > 
> > As far as TOPCAT can tell, the column contains a string, and so it is
> > unable to make a plot with it, or otherwise do anything much apart
> > from display the string contents.  If it understood that the column
> > contained a string with the semantics of an iso-8601 date/time,
> > it could make this plot.  Yes it may be possible to glean this
> > information by inspecting the utype, but in order to do that it needs
> > to have an understanding of the data model in question - a lot of work
> > for the developer, and needs to be updated every time a new data model
> > appears or is modified.  Moreover, the additional, probably rather
> > detailed, information supplied by the utype is not relevant for this
> > kind of processing.
> > 
> > You can think of similar stories for 'ctype' (or whatever) values of
> > stc-s, stc-x, sexagesimal, and other possibilities of your own device,
> > including domain-specific ones.  It should not be necessary to invent
> > a data model in order to flag this kind of thing, partly for practical
> > reasons (you need to reach agreement about a data model and update
> > software each time), and partly because use of a data model is orthogonal
> > to this issue.
> 
> ...and subtracting out all the high-falutin' computer science issues from
> today, we see that this is simply a question of whether to flag some value.
> Whether the value is flagged or not, if TOPCAT is to do what Mark's user's
> want, then TOPCAT has to be able to parse ISO-8601 datetime strings or
> sexagesimal strings or stc-s strings.  These parsing methods must all be in
> place, the question is how to trigger them and who decides when to do so.
> 
> Requiring an explicit metadata flag (whether expressed as a UCD, utype, ctype,
> xtype, unit or whatever) implies that the data provider (or her minion
> programmers) should be the one selecting how an application like TOPCAT
> chooses to interpret different values.  This, I think, is the real underlying
> issue.  Rather, might it not be asserted that TOPCAT is a power tool belonging
> to the user?
> 
> With a method to parse sexagesimal values - a method that is required in any
> event - isn't it trivial for TOPCAT to activate user controlled plotting
> capabilities for such string valued columns?

Rob,

you're right, you could do it like this.  Yes it would be trivial 
for TOPCAT (assuming you disregard the various documentation, hints, 
and defaulting mechanisms that it ought to provide to give the user 
a clue how to make such decisions, which in actual fact are hard work
to do well), but it's more effort for the user.  It's really a 
matter of convenience.  

For a concrete example of how this impacts on TOPCAT's users:
the plot windows contain a dropdown list of plottable columns.
This currently contains all the numeric-valued columns from the
table in question, but omits string-valued ones because it doesn't
know how to plot them.  Instead of selecting one of the items from
the list, the user can type instead something like "isoToMjd(epoch)".
But only if (a) they know which column contains that data (could
be a lot of columns to look at in the table) and (b) they've read
the manual.

More generally: At one end of the scale you can have a data format
like CSV (no data type declared) and it's up to either the user
or the application to make sense of each value.  If the user has
to mark values explicitly as numeric, or double precision, or
iso-8601 or whatever, it's fiddly for them, they have to read
documentation, they may have to understand what iso-8601 means, ...
If the application does it there may be performance implications.
In either case, the wrong decision might get made.

At the other end of the scale you have a data format featuring a 
semantic system (utypes, UCDs, units, plus maybe a load of other
magic) so sophisticated that the application can make decisions
on behalf of the astronomer about, say, how to perform a crossmatch
between two tables.

Of course where we want to be is somewhere between the two, and
the question is exactly where.  In my (biased, application author's)
opinion where it's feasible to provide information about a column 
which is likely to be useful to, and which can be easily used and 
not easily misused by application software, this is worth doing 
(which is not to say that more complicated things should necessarily
be avoided).  Numeric/string datatypes fall into this category, 
and so does a marker for, e.g. ISO-8601 or sexagesimal content.

Mark

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/