table metadata and the registry

Mon May 7 23:02:00 PDT 2007

Hi WGers,

I've been conversing with many of you regarding the general issue referred 
to as fine-grained/rich vs. coarse-grained registries.  One current 
manifestation of it is the issue of whether descriptions of table columns 
should appear in the registry.  We see use cases emerging that want to use 
this information for discovery and planning, but handling this information 
in the registry raises costly curation issues.  I would like to propose a 
solution to this issue that I believe can serve as a model for handling 
other reputed fine-grained information.  This solution will ultimately 
call for a standard format for describing a set of tables.

First, I recognize that before we can agree on whether to put fine-grained 
information into the registry, we need a common understanding of what 
qualifies as "fine-grained".  I have some ideas on this that I will be 
presenting next week in the RWG session.

The use cases that are driving table metadata into the registry are:
   a)  Finding tables based on the columns.  A specific use case is to
         build an SED from existing catalog data by searching for tables
         that have columns described by certain UCDs.
   b)  Automating the construction of specific queries to catalog
         services for use within a workflow.

One major reason that placing the column metadata in the registry
is attractive is that the registry is an existing system for
collecting the information and provides a common way to access it.
One current problem with our existing catalog services (Cone Search,
SIA, OpenSkyNode, and SSA) is each has a slightly different of
presenting this information.  Thus, in practice, it is difficult to
mine this information--you need 4 different methods.  For data
collections that are described independently of any service that
accesses them, there is no standard way of getting this information
other than having it in the registry.

I would like to propose we define a standard format for describing a
set of tables and all their columns that can be served by a single,
static URL.  With this, we can:
   1)  Include this URL in the resource description of any table
         service or data collection that includes tables.
   2)  Define a simple GET method that can be part of a standard
         service protocol to return this document.

Implementation considerations:
   o  More than one URL could be associated with a resource.  Thus, if
      a service or collection serves many tables, their descriptions
      could be distributed over several documents of manageable size.

   o  While the information is packaged into individual documents,
      a service can generate this information on the fly as necessary.
      (For example, if TAP were to define separate "getTables" and
      "getColumns" methods, the information could be aggregated via
      internal calls to these methods.)

   o  For existing "standards"--Cone Search, SIA, OpenSkyNode, (and if
      necessary, SSA)--we could devise trivial HTTP GET services that
      convert on-the-fly calls to their respective metadata methods
      into the standard format.  These services could be provided by
      registries.

The advantages are:
   *  The information originates at the service and is maintained by
        the publisher.
   *  There is a common way to get at the column information.  It is
        not restricted to standard services but can be associated with
        any data collection or custom service that handles catalogs.
   *  The information can be obtained through the registry without it
        being stored in the registry.
   *  A registry (or other data discovery service) may harvest and
        warehouse this information for the purposes of fine-grained
        discovery; at the same time...
        +  it does not require other registries to do the same, and
        +  it does not require/encourage publishers to put this
             information into the registry explicitly.

The pressure for supporting the above use cases is large, so we need
something quickly.  I would strongly recommend a v1.0 that is simple
and based on existing formats.  I think either of two such options
would work fine:
   o  a profile on VOTable
   o  the Catalog description model currently in the VODataService
      extension schema used in the registry
      (http://www.ivoa.net/xml/VODataService/v1.0).

I also want to point out the Source Catalog Data Model, which some of you 
may be familiar with.  Because of its emphasis on the astronomical 
semantics more than table & catalog structure, it's probably not a good 
candidate for the format itself.  However, it would be a good model for 
annotating a table description via utypes.

The point is to just support what people are already doing with the
registry.  If we want to add more to the format (or even totally
replace it), I recommend we save it for a save it for a subsequent
version.

So the general pattern for "fine-grained" information would be to have 
VOResource records point to this information that is primarily managed at 
the providers site.  Another area I would like to explore this idea is in 
using detail coverage information to aid in discovery.  We currently have 
a place in the VODataService schema (an extension of VOResource) a place 
to point to a detailed footprint service.  We would need to add a place to 
point to a table description.  Thus, there is a critical time issue for 
putting this into place.

I invite your comments, and I will raise this in Beijing during RegWG2.

cheers,
Ray