Who chooses? (was Re: content, format, ctype, or xtype ?)

Wed May 13 04:05:54 PDT 2009

On Tue, 12 May 2009, Rob Seaman wrote:

> I'm snow-blind with the blizzard of messages today.  Returning to Mark's use
> case from yesterday...
> 
> On May 11, 2009, at 5:48 AM, Mark Taylor wrote:
> 
> > The use case which I have in mind (and I think Doug is thinking along
> > similar lines) is this: a user acquires a VOTable from some source -
> > perhaps TAP, perhaps not.  It contains a column X whose contents
> > is a string in iso-8601 format - this is perhaps identified by
> > utype with part of the STC data model, or with some other data model,
> > or perhaps is not.  The user loads the table into TOPCAT
> > (or some other generic table handling software) and wants to make a
> > plot with column X as one of the axes.
> > 
> > As far as TOPCAT can tell, the column contains a string, and so it is
> > unable to make a plot with it, or otherwise do anything much apart
> > from display the string contents.  If it understood that the column
> > contained a string with the semantics of an iso-8601 date/time,
> > it could make this plot.  Yes it may be possible to glean this
> > information by inspecting the utype, but in order to do that it needs
> > to have an understanding of the data model in question - a lot of work
> > for the developer, and needs to be updated every time a new data model
> > appears or is modified.  Moreover, the additional, probably rather
> > detailed, information supplied by the utype is not relevant for this
> > kind of processing.
> > 
> > You can think of similar stories for 'ctype' (or whatever) values of
> > stc-s, stc-x, sexagesimal, and other possibilities of your own device,
> > including domain-specific ones.  It should not be necessary to invent
> > a data model in order to flag this kind of thing, partly for practical
> > reasons (you need to reach agreement about a data model and update
> > software each time), and partly because use of a data model is orthogonal
> > to this issue.
> 
> ...and subtracting out all the high-falutin' computer science issues from
> today, we see that this is simply a question of whether to flag some value.
> Whether the value is flagged or not, if TOPCAT is to do what Mark's user's
> want, then TOPCAT has to be able to parse ISO-8601 datetime strings or
> sexagesimal strings or stc-s strings.  These parsing methods must all be in
> place, the question is how to trigger them and who decides when to do so.
> 
> Requiring an explicit metadata flag (whether expressed as a UCD, utype, ctype,
> xtype, unit or whatever) implies that the data provider (or her minion
> programmers) should be the one selecting how an application like TOPCAT
> chooses to interpret different values.  This, I think, is the real underlying
> issue.  Rather, might it not be asserted that TOPCAT is a power tool belonging
> to the user?
> 
> With a method to parse sexagesimal values - a method that is required in any
> event - isn't it trivial for TOPCAT to activate user controlled plotting
> capabilities for such string valued columns?

Rob,

you're right, you could do it like this.  It's really a matter of
convenience.  

At one end of the scale you can have a data format
like CSV (no data type declared) and it's up to either the user,
or the application to make sense of each value.  If the user has
to mark values explicitly as numeric, or double precision, or
iso-8601 or whatever, it's fiddly for them, they have to read
documentation, they may have to have a clue what iso-8601 means, ...
If the application does it there may be performance implications.
In either case, the wrong decision might get made.

At the other end of the scale you have a data format featuring a 
semantic system (utypes, UCDs, units, plus maybe a load of other
magic) so sophisticated that the application can make decisions
on behalf of the astronomer about, say, how to perform a crossmatch
between two tables.

Of course where we want to be is somewhere between the two, and
the question is exactly where.  In my opinion where it's feasible

   the user could be required
to declare

a user tool could use CSV 
tables (no data type declared) and the user could be required to
declare before use for each column whether it's a number, or a 
string and/or represents an angle, or a time, 

We could have CSV tables (no data type declared) and
ask the user to mark each column before use as numeric
> 
> Rob
> 

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/