[TAP] Summary: data type for column metadata

Fri Apr 17 08:15:14 PDT 2009

Hi Pat 
A somewhat long comment again, just before the weekend.

>> I like this list.  I think we also need to include a recommendation 
>> (at
>> least) for how these should be mapped to VOTable types.  This is not 
>> only for consistency in TAP responses, but also for describing a table 
>> in the registry outside the context of TAP (e.g. describing the table 
>> returned from an SIA query).  Does this seem reasonable?
>>     ADQL type      VOTable
>>     BOOLEAN        boolean
>>     SMALLINT       short
>>     REAL           float
>>     DOUBLE         double
>>     TIMESTAMP      char arraysize="*", (format?)
>>                     (or is it numeric?)

I personally think a mapping fro ADQL->VOTable should be mandated. But it
has to go both ways doesn't it?
If I upload a VOTable into TAP_UPLOAD and write a query against it, I must
be allowed some expectation on the 
ADQL datatypes of the columns in the table that was uploaded.
In particular I wonder whether we can insist that the following two
equivalent actions MUST have the same result:

1. single query:
select *
  from table1 t1
  ,    table2 t2
 where <some constraints on t1>
   and <some constraints on t2>
   and <some join condition>

2. two queries
- retrieve result of
select * 
  from table1
 where <some constraints on t1>

- upload result as mytable1 and query

select *
  from TAP_UPLOAD.mytable1 t1
  ,    table2 t2
 where <some constraints on t2>
   and <some join condition>

If we do NOT mandate a particular mapping from ADQL->VOTable, for example if
TAP services were allowed to serialise INTEGER columns into char FIELDs, 
this would likely break the equivalence.
The problem is that the above list is not invertbile. And it seems that
currently VOTable does not give an option to make this so.
I guess that unless VOTable's datatype systems is expanded, something like
Mark's proposal SHOULD be available.

> In my experience with RDBMSs (sybase, DB2, postgresql) the 
> following is correctly interpreted by all:
> 
> yyyy-MM-dd HH:mm:ss.SSS
> 
> Sybase and postgresql are OK with the T separator between the 
> date and the time, but DB2 does not like it.
> 
SQLServer (tested 2005) supports both.
Is also happy with  yyyyMMdd[ HH:mm[:ss[.SSS]]] 

> >
> > > 2. add single additional optional metadata coordinate system spec
> >
> > I like the idea of relying on the existing solution available in 
> > VOTable using GROUPs and ucds/utypes.  Of course, it would 
> be good to 
> > capture this in the registry description via VODataService.  May I 
> > propose adapting the GROUPs model into VODataService.  Flatter 
> > structures makes searching easier so it probably would look 
> exactly like it.
> 
> It is true that the VOTable needs as much metadata as 
> possible, and in that sense I think takes priority over also 
> including that metadata in the TAP_SCHEMA and/or 
> VODataService. If we do not add this extra metadata to 
> TAP_SCHEMA (and it could be considerable work to make the 
> list of values and/or rules for how to make up your own) then 
> a client application could still resort to a query with 
> MAXREC=0 to see what the output metadata would look like. 
> This would be less work now and not preclude extending 
> TAP_SCHEMA later on. 
> 
> This does mean that TAP should specify how the column names 
> in TAP_SCHEMA (and the query) relate to the FIELDs in the 
> VOTable? Specifically, which attribute of FIELD holds the 
> (fully-qualified) column name? Otherwise, people could not 
> put the two together and understand the VOTable fully.
> 
I think it is a natural expectation that when a name can be identified for a
column in a query result that that name is used as the name attribute of a
FIELD.
when using "select * from sometable" these have to be inferred, when using
"select a as b, c as d" I expect b and d as names, not a and c.
This seems to me pretty straightforward. I can see problems to arise because
it is legal in ADQL (isn't it?) to pose (example from Markus) :

select a+b
  from sometable

I.e. no column name can be inferred. 

It is also potentially difficult for clients to deal with legal ADQL queries
that return multiple columns with the same name.
At least TOPCAT's plugin accessing GAVO's Millennium site returns only a
single set of columns from the Snapshots table when running the 
following SQL:

select * from snapshots s1, snapshots s2
where s1.snapnum=s2.snapnum

> It also means that (as mentioned above) that the VOTable 
> should extract the common coordinate system content from 
> STC/S formatted values, put them in a PARAM (or GROUP), and 
> refer to them from the FIELD via the ref attribute, as 
> described in the 
> http://www.ivoa.net/Documents/latest/VOTableSTC.html note. 
> This is pretty straight-forward and seems like the "right" 
> way to do it anyway.
> 
> So, we agree that the extra coordinate system metadata is out 
> (at least for TAP 1.0) and the caller will have to do a 
> MAXREC=0 query to get this information, if it is applicable 
> and available?
> 

There seem to be different angles in the approach to use VOTable for
expressing TAP related metadata.
The first is obvious. VOTable is the format for returning results of ADQL
queries, and
the VOTable syntax MUST therefore be used to assign metadata to the result
set.

The second seems to say that VOTable could also be used to express the
structure of the database itself.
This should then be done using empty VOTable-s for example. 
The attractive feature of this seems to be that VOTable metadata is richer
than a simple listing of columns with some attributes.
PARAM elements can assign extra features to TABLEs and possibly nested GROUP
elements can link columns together, potentiall adding 
coordinate systems etc. This may be useful indeed, though I would rather
have this implemented in TAP as a separate servcice request,
something like a 
	REQUEST=getEmptyVOTables
and NOT as a natural outcome of a "select top 0 * from sometable" ADQL
query.

I read (maybe mistakenly) from some comments the idea that such GROUP
structure could be carried over to the result of general ADQL queries. 
I would argue that for the general case (in a mathematical sense) this is
NOT the case. 
It is hard enough I think to deal with UCDs, UTYPEs, descriptions etc that
apply to single columns.
GROUPs are much harder I would argue.
It is very easy to think of queries that will make it extremely difficult to
translate GROUPs defined on tables to query results.
For example the concept of a single coordinate system for (GROUPs of)
position columns is easily broken by a union. 
Groups themselves are easily broken up by SELECT-s other than *, etc etc
etc.

Unless the query language becomes aware of these extra constructs and one
can explicitly query for GROUPs for example, 
it seems a rather hopeless task to create a parser that will do a good job
in the general case. 
In particular I would argue that we CAN NOT insist that the ADQL engines
behind a TAP service MUST or even SHOULD maintain such groupings 
and add them to VOTable metadata. Maybe the TAP spec could say something
like 
"The result VOTable-s MAY contain GROUP constructs linking different columns
in the query result, 
if these can be directly linked to similar constructs in the underlying
table metadata."

I agree that it may be useful to have more structure in metadata on the
tables so that clients can be helped composing simple queries	.
Clearly it is useful to know which RA column belongs to which Dec column
(and which error columns as well).
And the VOTable GROUP structure may be a useful template how to put this
formally in TAP metadata. 
We may decide to take this over using something like
"REQUEST=getEcmptyVOTables", or it could be modelled into the TAP_SCHEMA.
But IF so, that's where it should probably stop.

>> I sense some consensus on these two questions so far.  I would like to 
>> immediately turn this around into a VODataService proposal.  May I 
>> enlist the respondants to Pat's summary for consultation on this 
>> proposal?  (So far, this has been Pat, Gerard, Markus, and Mark.)

It seems to become clear that there are going to be 3 different ways to
express metadata related to a TAP service.
Registry's VODataService, which according to Ray need not be as
comprehensive as is required to write queries.
I agree with that, Registry should be sued to find potentially interesting
TAP services. 
Those services themselves should be able to give all required details, at
least by allowing querying of a TAP_SCHEMA.
But maybe also through an "empty VOTable".

All of these may have certain features of metadata in common, but likely
they will not be equivalent in information content.
They will be different "views" of the metadata,
They will also represent the metadata in different ways. TAP_SCHEMA as a set
of relational tables, to be queried using ADQL. 
VODataService as a special subset of the Registry XML schema. VOTable using
TABLE, FIELD, PARAM and GROUP etc.

I think (ofourse) that it will be beneficial to have a datamodel of the
required metadata that is independent of these three representations.
On the comments page I wrote on the wiki I had already made such a proposal.
I have updated this a little with GROUP and PARAm 
elements that may be of relevance. I also added a place holder for the
queryresult as it may.
I proposed it as a way to focus the discussion on the concepts and as a
means to make subselections relevant for the three different views.

Cheers

Gerard