How do we get metadata into the registry network?

Fri Jun 25 08:59:35 PDT 2004

One notable omission from the Registry interfaces document dated June 16th
is any way of loading metadata from the primary sources, which are the
files or databases in each data centre.  At present all we have is ways of
harvesting one registry from another: this reminds me of the village where
the inhabitants are said to make a living just by taking in each other's
washing.

All I have been able to find on the subject from scanning the emails and
documents is that maybe a web form or two will be provided for data centre
managers to use.  This may be feasible for the basic identity and
curational metadata for each site, but will be completely inadequate for
the bulk of the information.  It may help to assess the data volumes
involved.

Where to store metadata?

It's hard to procede further without referring to the difference of opinion
over where to store detailed metadata (i.e. descriptions of each file or
table in the data collections or databases).   The official AstroGrid view
is that it should all be stored in the publishing registry which may or
may not be co-located with the data centre providing the data, while the
US-NVO feels that it should be acquired from the data centre holding the
substantive files/tables using metadata queries similar to regular VOQL
queries.  The full registries may (UK) or may not (elsewhere) then choose
to harvest the detailed metadata from these publishing registries; every
registry would harvest the basic metadata, presumably.

For much of the following discussion the location does not matter:
metadata have to be stored *somewhere* or else many types of VO queries
will not be feasible.  In addition one could view the registry system as a
sort of cache of metadata, potentially reducing the number of queries that
primary data centres have to handle.  Of course if the publishing registry
is co-located with the data centre, there's little  difference.

Metadata Size Estimates

For tabular data I will start with Vizier, as the largest gathering of
published tabular datasets from around the world.  It is of direct
concern to AstroGrid as there is a Vizier mirror at IoA Cambridge. Today
Vizier contains 4195 tables, with approx 100,000 columns.  The Registry
(or somewhere) needs to store at least 5 items (maybe 100 bytes) per
column, making  at least 10 MB of column data.   The additional metadata
needed per table is something we need to debate, but must come to a few
hundred bytes at least; the CDS system has most of the column and table
metadata (except for UCDs) in README files, one per table, which seem to
be mostly 10 - 50 kbytes in size.    So the ballpark figure for Vizier
metadata must be around 100 MB.  This is not a trivial amount of data to
be loaded, and obviously has to be gathered using special tools, and
needs a serious DBMS to store and search it efficiently.

For static data centres this is a once (or few) times operation; but many
sites (including Vizier) are frequently updated, and the metadata for the
new datasets need to be collected regularly, at least once a day I'd guess.

Vizier is, of course, an exceptional case (and its metadata are already
organized exceptionally well in what conceptually at least is a single
large DBMS).  Let us take the case of a much more modest data centre,
www.ledas.ac.uk, more typical in size.  It uses four different DBMS:
Sybase, MySQL, WCStools, and BROWSE.  Excluding the Chandra mirror (which
uses Sybase) there are some 400 tables, with ~7000 columns.  Most
metadata are stored systematically in separate tables (except for UCDs,
not yet fully implemented here).

Just within the UK's data centres, I think the DBMS in use include: MySQL,
Sybase-ASE, WCStools, DB2, Oracle, SQL server, and O2; this suggests that
quite a number of different data gathering tools are going to be needed.

Non-tabular Metadata

So far I have only estimated sizes of tabular metadata.  Collections of
images, spectra, time-series, and raw datasets also need to have their
metadata stored systematically.  In some ways this is simpler, as (usually)
every file in a given image collection will have most metadata the same,
with only celestial coordinates and perhaps a few other values such as
epoch different.   It should be possible, with a suitable database
structure, to store these metadata more efficiently.

Without a lot of work it is hard to estimate the sizes involved.  I just
asked my colleagues in the XMM-Newton project how many FITS files are in
the archive here: it is around 2 million (this includes data not yet in
ESA's public archive).  Each of these has ~20 FITS headers that should be
searchable in a registry, each needing ~50 bytes to store, giving a total
data volume of 2 GB.  (These are my guesstimates, but probably right to a
factor or 2 or so).  One could perhaps decide that only certain types of
file are worthy of registration in a publishing registry, but it might be
hard to reach agreement on which; the easiest solution surely will be to
register all of them.

I suspect that there are 20 or 30 data centres around the world with
similar or larger numbers of data files,  and many other data centres
with somewhat smaller data collections.

I am making the assumption that these files should all have their metadata
individually registered.  That may not be essential at first, but I expect
that individually addressable tables and files will become the norm as the
VO develops.   The total metadata volume for the VO is therefore of the
order of 50 GB, with rather large error bars on that.  I'd be happy for
other people to come up with their own estimates, in the hope that we can
converge on a more accurate figure.

Of course the volume within the Registry is much smaller if we get the
detailed metadata from the data centres themselves, and only store the
upper levels of metadata. In either case, however, tools are needed to
extract the metadata for storage in a VO-compliant form somewhere.  If
this is left to those hard-pressed individuals running the data centres,
this will take a long time, because I don't think anyone has yet defined
the job, nor defined the interfaces in sufficient detail.  I would have
hoped that the various VO projects would become involved in this activity.

So - who is going to develop these tools?

-- 
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.