[TAP] data type for column metadata
Arnold Rots
arots at head.cfa.harvard.edu
Wed Mar 25 10:37:35 PDT 2009
I don't want to drag this discussion out, but just to make it clear
what my concern is:
When we deal with spatial coordinates people now seem reasonably aware
that there is more to this than just to get a coordinate pair.
The tendency that I saw in the TAP time discussion was that (at least
some) people were (correctly) concerned about whether to treat a time
column as an ISO-8601 string, a datetime data type, or a double (or a
quad, for that matter); but that they then assumed that all had been
solved and we would know exactly what time it was.
That is far from being the case.
I agree that the remaining questions are not necessarily part of this
discussion, but they do need to be kept in mind. And so far I have
seen little evidence that this is the case. The mistaken impression
that anyone who cries "MJD!" has provided all information necessary
presents a dangerous delusion. A system based on this - just to keep
things simple - is going to be useless for many applications.
Oh yes, and similar considerations, of course, apply to redshifts and
Doppler velovcities.
Cheers,
- Arnold
Markus Demleitner wrote:
> Dear DAL folks,
>
> On Wed, Mar 25, 2009 at 09:45:00AM -0400, Arnold Rots wrote:
> > Patrick Dowler wrote:
> > > This is more "how to deal with these in an RDBMS" rather than "the right way
> > > to deal with them", but I think that is the issue here :-)
> > My concern is that the present discussion has jummped to "what is the
> > right way" to deal with numbers without first asking what it is that
> > we are dealing with.
> Well, my take on this is that these are two separate issues that
> should be kept separate as far as possible.
>
> The discussion started as "How do we say, in TAP metadata, how a
> given quantity is stored in a database". Since the database is being
> queried through ADQL, and column types restrict what queries are
> syntactically legal, the question probably should be phrased: "How
> does a certain column look to ADQL?".
>
> A second point was "How is a value for that column transferred?"
> (i.e., do we want a timestamp type or geometry types in VOTable?).
> This is a different, and at least equally involved question.
>
> Both questions ideally are completely unrelated to any interpretation
> -- whether a float represents a magnitude, a polarization, or an error
> for any of these quantities is, of course, decided independently of
> the fact that it is stored as a float. We have UCDs for that. Of
> course UCDs are insufficient to hold all the intricacies involved
> with physical quantities, and yet we talk about declaring and
> transferring floats, and rightly so.
>
> Every nontrivial ADQL query will have lots of pitfalls, and the
> resulting tables will require lots of interpretation by both the
> client and the human using it. I'm afraid that's what science is
> about, and trying to push much of it into protocols will turn their
> descriptions into physics textbooks. This may be a good thing, but
> will take ages.
>
> > > Yes. In response to Gerard's list of data types in SQL, I mentioned that I
> > > have never found a good use for either DATE or TIME types - only TIMESTAMP.
> > > Maybe other people have a different experience?
> >
> > I'm not talking about a data type, here, but about a concept: it is a
> > generic time stamp, not a spatial coordinate, not a duration, etc.
> Well, but for both TAP and VOTable data types only the fact that
> *some* time is being stored here is relevant. What time scale it is
> in, whether it's the time some GRB could be detected on Antares or
> the start of an exposure has nothing to do with data or legal queries
> but is entirely an issue of metadata and interpretation.
>
> I'm not saying it's unimportant, I'm just saying that it doesn't
> matter for purposes of storing or transferring data. I would also
> maintain that it doesn't matter for writing queries, since, as
> pointed out above, I have strong doubts that we want to build so
> much physics into your protocol that queries "violating physics"
> become some kind of error.
>
> I have a feeling much of the tendency to mix metadata into data is
> due from the expectation that clients or servers will do "magic"
> transforms to "fix" user booboos (e.g., comparing a timestamp in TT
> and UTC will magically to "the right thing"). For the record, I
> don't think this is going to fly very wide, and certainly does not
> warrant complicating protocols and metadata by a significant factor
> at this point (see below).
>
> > > Now, we chose MJD for "astronomical times" because it is much easier to
> > > compute things (histograms, statictics, etc) when the number is directly
> > > accessible to SQL.
> > Sorry, here you fall off the cliff: MJD DOES NOT IMPLY A TIME SCALE!
> > For "astronomical" times we need to know Time Scale (see list above),
> > Reference Position (see below), and possibly the zero point (see below).
> The "we" here is whoever interprets the data. A service should have
> some way to communicate this kind of data, and using STC encoded in
> VOTables it has. One could argue this information should be
> available to clients before retrieving a VOTable, but again this is
> not about data types but about richer metadata.
>
> [My take: *If* we want to expose metadata this detailed up front, why not
> use the proven techique of DAL v1 and define some way to retrieve
> an "empty" VOTable for a given query? There are pitfalls in this as
> well, and I'd rather see an implementation of this before it ends up
> in a standard, but I think that would be the least intrusive way]
>
> > Please do not use ISO-8601 with timezone indicators - just the
> > restricted form I gave above. Only use UTC - Timezone 0 - conforming
> > with current astronomical usage.
> Full support here. *Don't* mess with time zones, DST, and all those
> other horrors. The laws of physics are complex enough, we don't want
> to have to account for the laws of 200+ governments on this planet.
>
> > > ISO-8601 format w/ timezone as above. I do not know where one would say a
> > > column is JD vs MJD... hopefully someplace more machine-usable than the
> > > documentation or comments :)
> > Well, that is the issue: you need to know what you are talking about,
> > in the queries and the results, in order to figure out what it means.
> Maybe I can make my point again: That "you" in general will be a
> scientist, and only in simple cases a machine client. These simple
> cases will have to be exactly spelled out, down to actual
> transformations, error conditions, etc. An example where I feel this
> has not been done is ADQL failing to define the behaviour of an ADQL
> server for a phrase like CONTAINS(POINT("FK4", raj2000, dej2000),
> CIRCLE("ICRS", 91, -10, 2)); there are (at least) three issues in
> there the resolution of which currently is left to the implementor's fancy.
>
> So: let's try to get an ADQL-, TAP-, and VOTable-level description of
> the data right *first*. That's feasable on the time scale of project
> funding. *Then* talk about physics, when you're ready to define
> things as rigorously as necessary to let services interoperate
> without a scientist thinking about them. Until that time, assume
> that scientists will need to think about what query they write. Yes,
> they will make mistakes. As they always have. But at least they
> have a chance to make them.
>
> > This transcends data models and it's not clear that there is a
> > mechanism to learn that type of information...
> Well, there's referenceURL in the registry metadata. Let's face it:
> we cannot encode all aspects of a physical measurement or even a
> simulation in some formal language yet. We should do what's easy,
> work toward what's feasable after the spec is released, and apply for
> funding to try and get what's impossible now within reach maybe 10
> years from now. For the rest, use what's worked for 400 years:
> natural language and math mashed together in some mess only
> intelligable to humans.
>
> ========= Philosophy ends ==========
>
> So -- I'll support basically any choice of SQL types being part of
> TAP metadata; I think we definitely should have timestamps in there
> (though of course, people are free to store times in floats, integers
> or varchars if that's appropriate; it's just that there is one
> standard way to store and transfer timestamps, and everything else
> has to be manually supported).
>
> Since ADQL allows geometries to be returned from queries and quite a
> few RDBMSes allow geometry-like objects in tables, it is probably
> wise we have ADQL-type geometries in that type system as well
> (though, given that ADQL geometries are data/metadata soup anyway, I
> think transferring them as alphabet soup in VOTables is just fine).
>
> I agree that checking the relevance for query writing is a good test
> for what granularity we need here. A quick inspection of the ADQL
> grammar seems to suggest that it's POINT and REGION indeed:
>
> <coord_value> ::= <point> | <column_reference>
>
> -- which singles out POINT from other geometry value expressions.
>
> This reasoning would break if we anticipated the need to define
> functions taking only certain types of geometries. Do we?
>
> Cheers,
>
> Markus
>
--------------------------------------------------------------------------
Arnold H. Rots Chandra X-ray Science Center
Smithsonian Astrophysical Observatory tel: +1 617 496 7701
60 Garden Street, MS 67 fax: +1 617 495 7356
Cambridge, MA 02138 arots at head.cfa.harvard.edu
USA http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------
More information about the dal
mailing list