table metadata and the registry
Ray Plante
rplante at poplar.ncsa.uiuc.edu
Mon May 7 23:02:00 PDT 2007
Hi WGers,
I've been conversing with many of you regarding the general issue referred
to as fine-grained/rich vs. coarse-grained registries. One current
manifestation of it is the issue of whether descriptions of table columns
should appear in the registry. We see use cases emerging that want to use
this information for discovery and planning, but handling this information
in the registry raises costly curation issues. I would like to propose a
solution to this issue that I believe can serve as a model for handling
other reputed fine-grained information. This solution will ultimately
call for a standard format for describing a set of tables.
First, I recognize that before we can agree on whether to put fine-grained
information into the registry, we need a common understanding of what
qualifies as "fine-grained". I have some ideas on this that I will be
presenting next week in the RWG session.
The use cases that are driving table metadata into the registry are:
a) Finding tables based on the columns. A specific use case is to
build an SED from existing catalog data by searching for tables
that have columns described by certain UCDs.
b) Automating the construction of specific queries to catalog
services for use within a workflow.
One major reason that placing the column metadata in the registry
is attractive is that the registry is an existing system for
collecting the information and provides a common way to access it.
One current problem with our existing catalog services (Cone Search,
SIA, OpenSkyNode, and SSA) is each has a slightly different of
presenting this information. Thus, in practice, it is difficult to
mine this information--you need 4 different methods. For data
collections that are described independently of any service that
accesses them, there is no standard way of getting this information
other than having it in the registry.
I would like to propose we define a standard format for describing a
set of tables and all their columns that can be served by a single,
static URL. With this, we can:
1) Include this URL in the resource description of any table
service or data collection that includes tables.
2) Define a simple GET method that can be part of a standard
service protocol to return this document.
Implementation considerations:
o More than one URL could be associated with a resource. Thus, if
a service or collection serves many tables, their descriptions
could be distributed over several documents of manageable size.
o While the information is packaged into individual documents,
a service can generate this information on the fly as necessary.
(For example, if TAP were to define separate "getTables" and
"getColumns" methods, the information could be aggregated via
internal calls to these methods.)
o For existing "standards"--Cone Search, SIA, OpenSkyNode, (and if
necessary, SSA)--we could devise trivial HTTP GET services that
convert on-the-fly calls to their respective metadata methods
into the standard format. These services could be provided by
registries.
The advantages are:
* The information originates at the service and is maintained by
the publisher.
* There is a common way to get at the column information. It is
not restricted to standard services but can be associated with
any data collection or custom service that handles catalogs.
* The information can be obtained through the registry without it
being stored in the registry.
* A registry (or other data discovery service) may harvest and
warehouse this information for the purposes of fine-grained
discovery; at the same time...
+ it does not require other registries to do the same, and
+ it does not require/encourage publishers to put this
information into the registry explicitly.
The pressure for supporting the above use cases is large, so we need
something quickly. I would strongly recommend a v1.0 that is simple
and based on existing formats. I think either of two such options
would work fine:
o a profile on VOTable
o the Catalog description model currently in the VODataService
extension schema used in the registry
(http://www.ivoa.net/xml/VODataService/v1.0).
I also want to point out the Source Catalog Data Model, which some of you
may be familiar with. Because of its emphasis on the astronomical
semantics more than table & catalog structure, it's probably not a good
candidate for the format itself. However, it would be a good model for
annotating a table description via utypes.
The point is to just support what people are already doing with the
registry. If we want to add more to the format (or even totally
replace it), I recommend we save it for a save it for a subsequent
version.
So the general pattern for "fine-grained" information would be to have
VOResource records point to this information that is primarily managed at
the providers site. Another area I would like to explore this idea is in
using detail coverage information to aid in discovery. We currently have
a place in the VODataService schema (an extension of VOResource) a place
to point to a detailed footprint service. We would need to add a place to
point to a table description. Thus, there is a critical time issue for
putting this into place.
I invite your comments, and I will raise this in Beijing during RegWG2.
cheers,
Ray
More information about the dm
mailing list