Registry data structures

Fri Jun 25 09:04:20 PDT 2004

The figure for the volume of metadata that I derived in my last posting is
not huge: 50 GB is similar to some large source catalogues, but it means
searching it without indices or data structures to assist  will be
infeasibly slow.

The problem comes with the intrinsically complex structure of our datasets
and therefore metadata.  We need to work out the main search patterns,
which may have implications for the data structure.

It would be nice if we had a good collection of use-cases which we could
analyse to work out likely search and usage patterns.  It maybe that they
exist somewhere that I've missed.   The science and use-cases collected by
AstroGrid and NVO have mostly been written on the assumption that the
astronomers know what resources they need and where to find them, so the
most basic registry functionality can be ignored.  Some of the queries need
access to column-level metadata, but here it would be equally feasible to
get them from the data centre concerned, as from the registry.

Spatial Coverage

Another unsolved problem in the registry system is how to store and search
spatial coverage data, e.g. to answer questions like: "where can I find an
<image|spectrum|source-list> in <band> around <RA,DEC>". The point is that
the existing methods work well only for the large systematic surveys of
the large parts of the sky, but increasingly data  are coming on line
from large collections of  individual pointings, from space observatories
such as HST, Chandra, XMM-Newton, etc., and also from ground-based
telescopes in the optical, infra-red, and radio bands.  Finding which
instrument has ever observed the patch of sky of interest presents a real
problem at present.  The Registry ought to be able to assist.

An element called Coverage.Spatial is a defined element in our current
service content metadata: this can specify patches of sky as circles,
polygons, etc.  As far as I know, however,  we have not yet worked out a
good way of storing and searching these, given that a single mission will
often have covered thousands of small patches of sky.  In principle a
bitmask system would serve and Patricio Ortiz has been working on ways of
doing this; but covering the sky with a bit-mask with just a one-degree
resolution still requires about 8 kB per mask.  One would like higher
resolution to avoid too many false positives, yet this would seem to use
too much storage and would make searches rather too slow.

Possible Data Structures

The two obvious cases are using a purely relational structure, and a purely
hierarchical one.  It may be that some intermediate case is better, but I
shall explore the extremes first.

If we go for a relational structure,  in principle the Full Registry could
contain a small number of tables of metadata, perhaps one for all tabular
datasets, one for all images, etc. The table of tabular metadata would
have one column per table and them columns for each item we want to be
searchable, e.g.

 - PublisherId
 - Identifier (of table)
 - columnName
 - columnUCD

To make this table conform to the Date/Codd rules, the primary key to
guarantee uniqueness would be the combination of the first 3 fields listed
above, I should think.  We might well want indices on Identifier and
columnUCD.

For image metadata, there would be one row per image, and columns for:
 - PublisherId
 - Identifier (of the image)
 - Instrument
 - Coverage.Spatial
 - Coverage.Spectal or maybe Coverage.Spectral.Bandbass
 - Coverage.Temporal.StartTime and .StopTime
 - probably some elements covering resolution, sensitivity, etc.

The number of individual images accessible to VO users is perhaps only a
few millions to tens of millions, so this will only be table of modest
size, but B-tree indices will be needed to allow efficient searching on
complex criteria.  Since each image only covers a small more-or-less
contiguous patch of sky, the Coverage.Spatial element could be indexed by
an R-tree (or similar) as built in to most modern RDBMS.

This sort of structure, with one table covering all resources of a given
type in the whole VO, is only feasible if the Full Registry holds it.
If we go for the less detailed storage in the Full Registry, then we need
an index entry only for each table rather than each column, and each
collection of images, not for each image.  The latter will require a much
more complex data type for Coverage.Spatial, perhaps the 8kB bitmask.
Unfortunately these would not be handled by the built-in indexing of any
RDBMS that I know about.

The opposite extreme, is to go for a hierarchy more-or-less mapping the
data.  The levels would be something like this:

level 0: data centre
  level 1: data collection (e.g. Vizier) or database
    level 2: individual table
       level 3: column

level 0: data centre
  level 1: observatory data collection
    level 2: observation
      level 3: exposure
        level 4: dataset (e.g. image, spectrum, etc.)
          level 5: FITS keyword (those worth indexing)

The appropiate language for searching would then be much nearer to Xpath
than SQL.   Using such a hierarchical scheme, it would be perfectly
feasible to store the detailed levels at the data centre, with only the
upper levels of the hierarchy stored in the Registry.

Clearly we need to start off with a simple registry, which just handles
the simple functions, but the VOQL will need column metadata to work
properly on anything beyond very extremely queries, so decisions on how to
store the necessary information need to be taken very soon, in my opinion.

-- 
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.