VODataService becoming WD, then PR

Wed Nov 19 02:10:34 PST 2008

Dear Doug and others

Apologies for the post to multiple lists, but I think this issue may require
input from TAP and ADQL as well as from Registry.
For those who have not followed the discussion, this mail is a reply in the
thread starting with  http://www.ivoa.net/forum/registry/0811/2020.htm .

I comment on two issues

1. SQL types
> 
> Regarding native SQL datatypes, this has been suggested 
> before for inclusion in the TAP schema (and VODataService), 
> and would be easy enough to do, however since table transport 
> is based upon VOTable providing the column type mappings 
> using VOTable types is the primary requirement, and is DBMS 
> neutral.  SQL datatypes tend to be DBMS-specific so if we did 
> support this we would have to decide whether to merely pass 
> through the native SQL datatype namess, or possibly map them 
> to something more general such as JDBC types.  For example 
> see http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
> Although this is somewhat Java-centric, the problem addressed 
> is general and the approach allows a client to be more 
> DBMS-independent.
> Or maybe we should just pass through native types.
> 

It might make sense that we would look to ADQL to decide whiche SQL data
types to use. 
That said, there does not seem to be any explicit usage of data types by
ADQL (yet). 
One usage could be in a cast statement, but though CAST is a keyword, it
seems not to be used explicitly. Another usage is ofcourse in DDL (CREATE
TABLE ... ), which the current version of ADQL also does not include. It
might be good to know what the ideas are for the future version of ADQL.

In any case, ADQL is based on SQL92, which *does* have data types (and
casts). One might consider using these iso JDBC, as maybe future versions of
ADQL might introduce them (??).
See http://savage.net.au/SQL/sql-92.bnf.html#data%20type and further for the
SQL92 BNF <data type> statement. 

Irrespective of this, according to 3.1.1.2 in
http://www.ivoa.net/internal/IVOA/TableAccess/TAP-v0.3.pdf, TAP also has a
mode allowing ("MAY") pass-through of native SQL, though it seems it has not
been worked out in detail yet.
Taking this option seriously might have some impact on the metadata
specification.
For example one might then want to include a "nativeSQLType" for the column
metadata.
ALso I suppose the TAP service as a whole could(should) specify the database
vendor.

2. Index metadata

> On Tue, 18 Nov 2008, Gerard wrote:
> > Maybe again for a possible TAP extension:
> > Should/could indexes be included in the metadata for a table?
> > When composing queries it is often very useful to know what 
> indexes exist.
> 
> This is already included I believe (it was discussed when we 
> negotiated the most recent TAP schema and VODataService).  
> What is currently proposed is not a full specification of 
> indexes, but an optional per-column flag indicating if that 
> particular column is included in an index.  IRSA for example 
> currently does this, hence we have an existing use case where 
> this has been found to be sufficient.

I think this is a mistake.
It is very important to know precisely which index a column is part of.
For one thing, columns can be part of many indexes at the same time.
But most importantly one needs to know where in the index a column occurs
and which other columns are part of the index and in which order. 
Knowledge of which columns predced a given column in a given index may help
me write better performing queries by using more WHERE clauses.
knowledge about the total set of columns may help me be more careful in
writing the SELECT clause. Using only columns from such an index iso "SELECT
* ..." may avoid expensive lookups in the underlyinhg table.

One may furthermore wish to know whether the table is ordered (clustered) on
the index.

I think the following XML schema snippet takes this richness into account
and is imho much simpler and more useful than anything one can define at the
column level.

 <xs:complexType name="TableIndex">
    <xs:annotation>
      <xs:documentation>
        In this model an index is defined in the context of a Table.
        Hence the table itself need not be part of this definition.
      </xs:documentation>
    </xs:annotation>
    <xs:sequence> <!-- the ordering of the columns in an index is very
important -->
      <xs:element name="name" type="xs:string">
        <xs:annotation>
          <xs:documentation>
            Name may have to be unique in the table context.
          </xs:documentation>
        </xs:annotation>
      </xs:element>
      <xs:element name="isClustered" type="xs:boolean" default="true"
minOccurs="0">
        <xs:annotation>
          <xs:documentation>
            Indicates whether the table is clustered on (sorted on) this
table.
            There can be only one such index per table.
          </xs:documentation>
        </xs:annotation>
      </xs:element>
      <xs:element name="column" type="xs:string" maxOccurs="unbounded">
        <xs:annotation>
          <xs:documentation>
            Must refer to the name of a column in the Table that 
            is supposed to be the context of this index.
          </xs:documentation>
        </xs:annotation>
      </xs:element> 
    </xs:sequence>
  </xs:complexType>

The separate documentation of indexes as performed by SkyServer
(http://cas.sdss.org/dr7/en/help/browser/browser.asp) is a use case that
would directly benefit from an explicit index definition.

Cheers

Gerard