[TAP] data type for column metadata

Wed Mar 25 08:13:26 PDT 2009

Dear DAL folks,

On Wed, Mar 25, 2009 at 09:45:00AM -0400, Arnold Rots wrote:
> Patrick Dowler wrote:
> > This is more "how to deal with these in an RDBMS" rather than "the right way 
> > to deal with them", but I think that is the issue here :-) 
> My concern is that the present discussion has jummped to "what is the
> right way" to deal with numbers without first asking what it is that
> we are dealing with.
Well, my take on this is that these are two separate issues that
should be kept separate as far as possible.

The discussion started as "How do we say, in TAP metadata, how a
given quantity is stored in a database".  Since the database is being
queried through ADQL, and column types restrict what queries are
syntactically legal, the question probably should be phrased: "How
does a certain column look to ADQL?".

A second point was "How is a value for that column transferred?"
(i.e., do we want a timestamp type or geometry types in VOTable?).
This is a different, and at least equally involved question.

Both questions ideally are completely unrelated to any interpretation
-- whether a float represents a magnitude, a polarization, or an error
for any of these quantities is, of course, decided independently of
the fact that it is stored as a float.  We have UCDs for that.  Of
course UCDs are insufficient to hold all the intricacies involved
with physical quantities, and yet we talk about declaring and
transferring floats, and rightly so.

Every nontrivial ADQL query will have lots of pitfalls, and the
resulting tables will require lots of interpretation by both the
client and the human using it.  I'm afraid that's what science is
about, and trying to push much of it into protocols will turn their
descriptions into physics textbooks.  This may be a good thing, but
will take ages.

> > Yes. In response to Gerard's list of data types in SQL, I mentioned that I 
> > have never found a good use for either DATE or TIME types - only TIMESTAMP. 
> > Maybe other people have a different experience?
> 
> I'm not talking about a data type, here, but about a concept: it is a
> generic time stamp, not a spatial coordinate, not a duration, etc.
Well, but for both TAP and VOTable data types only the fact that
*some* time is being stored here is relevant.  What time scale it is
in, whether it's the time some GRB could be detected on Antares or
the start of an exposure has nothing to do with data or legal queries
but is entirely an issue of metadata and interpretation.

I'm not saying it's unimportant, I'm just saying that it doesn't
matter for purposes of storing or transferring data.  I would also
maintain that it doesn't matter for writing queries, since, as
pointed out above, I have strong doubts that we want to build so
much physics into your protocol that queries "violating physics"
become some kind of error.

I have a feeling much of the tendency to mix metadata into data is
due from the expectation that clients or servers will do "magic"
transforms to "fix" user booboos (e.g., comparing a timestamp in TT
and UTC will magically to "the right thing").  For the record, I
don't think this is going to fly very wide, and certainly does not
warrant complicating protocols and metadata by a significant factor
at this point (see below).

> > Now, we chose MJD for "astronomical times" because it is much easier to 
> > compute things (histograms, statictics, etc) when the number is directly 
> > accessible to SQL.
> Sorry, here you fall off the cliff: MJD DOES NOT IMPLY A TIME SCALE!
> For "astronomical" times we need to know Time Scale (see list above),
> Reference Position (see below), and possibly the zero point (see below).
The "we" here is whoever interprets the data.  A service should have
some way to communicate this kind of data, and using STC encoded in
VOTables it has.  One could argue this information should be
available to clients before retrieving a VOTable, but again this is
not about data types but about richer metadata.

[My take: *If* we want to expose metadata this detailed up front, why not 
use the proven techique of DAL v1 and define some way to retrieve
an "empty" VOTable for a given query?  There are pitfalls in this as
well, and I'd rather see an implementation of this before it ends up
in a standard, but I think that would be the least intrusive way]

> Please do not use ISO-8601 with timezone indicators - just the
> restricted form I gave above. Only use UTC - Timezone 0 - conforming
> with current astronomical usage.
Full support here.  *Don't* mess with time zones, DST, and all those
other horrors.  The laws of physics are complex enough, we don't want
to have to account for the laws of 200+ governments on this planet.

> > ISO-8601 format w/ timezone as above. I do not know where one would say a 
> > column is JD vs MJD... hopefully someplace more machine-usable than the 
> > documentation or comments :)
> Well, that is the issue: you need to know what you are talking about,
> in the queries and the results, in order to figure out what it means.
Maybe I can make my point again: That "you" in general will be a
scientist, and only in simple cases a machine client.  These simple
cases will have to be exactly spelled out, down to actual
transformations, error conditions, etc.  An example where I feel this
has not been done is ADQL failing to define the behaviour of an ADQL
server for a phrase like CONTAINS(POINT("FK4", raj2000, dej2000),
CIRCLE("ICRS", 91, -10, 2)); there are (at least) three issues in
there the resolution of which currently is left to the implementor's fancy.

So: let's try to get an ADQL-, TAP-, and VOTable-level description of
the data right *first*.  That's feasable on the time scale of project
funding.  *Then* talk about physics, when you're ready to define
things as rigorously as necessary to let services interoperate
without a scientist thinking about them.  Until that time, assume
that scientists will need to think about what query they write.  Yes,
they will make mistakes.  As they always have.  But at least they
have a chance to make them.

> This transcends data models and it's not clear that there is a
> mechanism to learn that type of information...
Well, there's referenceURL in the registry metadata.  Let's face it:
we cannot encode all aspects of a physical measurement or even a
simulation in some formal language yet.  We should do what's easy,
work toward what's feasable after the spec is released, and apply for
funding to try and get what's impossible now within reach maybe 10
years from now.  For the rest, use what's worked for 400 years:
natural language and math mashed together in some mess only
intelligable to humans.

========= Philosophy ends ==========

So -- I'll support basically any choice of SQL types being part of
TAP metadata; I think we definitely should have timestamps in there
(though of course, people are free to store times in floats, integers
or varchars if that's appropriate; it's just that there is one
standard way to store and transfer timestamps, and everything else
has to be manually supported).

Since ADQL allows geometries to be returned from queries and quite a
few RDBMSes allow geometry-like objects in tables, it is probably
wise we have ADQL-type geometries in that type system as well
(though, given that ADQL geometries are data/metadata soup anyway, I
think transferring them as alphabet soup in VOTables is just fine).

I agree that checking the relevance for query writing is a good test
for what granularity we need here.  A quick inspection of the ADQL
grammar seems to suggest that it's POINT and REGION indeed:

<coord_value> ::= <point> | <column_reference>

-- which singles out POINT from other geometry value expressions.

This reasoning would break if we anticipated the need to define
functions taking only certain types of geometries.  Do we?

Cheers,

             Markus