Registry data structures

Fri Jun 25 12:20:09 PDT 2004

Hi Clive,

We have spent a good bit of time working with the problem you mention
below with interoperating between sparse observations vs. large surveys
(HST & SDSS for example).  Not sure if you are familiar with the idea of
'footprint' services (Szalay). Based on some aspects of the spatial
coverage concepts being developed these services will allow rapid
coverage maps between resources, enabling finer probing into datasets.
The general idea is to be able to find common joins between resources
and the same ideology can be extended into band/time.  I'm not
trivializng the problem yet this is an alternative solution which has
been proposed in other venues.  I also know CDS has spent a bit of time
on this in the past as well but I'm not current with their ideas as
much.

I guess along this line of thought my question to you is why couldn't
parameter mining draw on functional elements from the registry rather
than a persistent mapping?  There is a level of uniqueness with
resources that services can address efficiently.

I completely understand why having everything at disposal makes mining
through heterogeneous sets appear friendly,  but I also think looking
into the integration of extended (web) registry services might eliminate
a lot of static data definition and provide a very adaptable set of
interfaces.  You still would be storing potentially everything,  it is
more a registry implementation issue.  The sophistication in merging
would reside dually in the client and provider and allow a great range
of creative exploration on both ends.

-Gretchen

-----Original Message-----
From: owner-registry at eso.org [mailto:owner-registry at eso.org] On Behalf
Of Clive Page
Sent: Friday, June 25, 2004 12:04 PM
To: registry at ivoa.net
Cc: Mike Watson
Subject: Registry data structures

The figure for the volume of metadata that I derived in my last posting
is not huge: 50 GB is similar to some large source catalogues, but it
means searching it without indices or data structures to assist  will be
infeasibly slow.

The problem comes with the intrinsically complex structure of our
datasets and therefore metadata.  We need to work out the main search
patterns, which may have implications for the data structure.

It would be nice if we had a good collection of use-cases which we could
analyse to work out likely search and usage patterns.  It maybe that
they
exist somewhere that I've missed.   The science and use-cases collected
by
AstroGrid and NVO have mostly been written on the assumption that the
astronomers know what resources they need and where to find them, so the
most basic registry functionality can be ignored.  Some of the queries
need access to column-level metadata, but here it would be equally
feasible to get them from the data centre concerned, as from the
registry.

Spatial Coverage

Another unsolved problem in the registry system is how to store and
search spatial coverage data, e.g. to answer questions like: "where can
I find an <image|spectrum|source-list> in <band> around <RA,DEC>". The
point is that the existing methods work well only for the large
systematic surveys of the large parts of the sky, but increasingly data
are coming on line from large collections of  individual pointings, from
space observatories such as HST, Chandra, XMM-Newton, etc., and also
from ground-based telescopes in the optical, infra-red, and radio bands.
Finding which instrument has ever observed the patch of sky of interest
presents a real problem at present.  The Registry ought to be able to
assist.

An element called Coverage.Spatial is a defined element in our current
service content metadata: this can specify patches of sky as circles,
polygons, etc.  As far as I know, however,  we have not yet worked out a
good way of storing and searching these, given that a single mission
will often have covered thousands of small patches of sky.  In principle
a bitmask system would serve and Patricio Ortiz has been working on ways
of doing this; but covering the sky with a bit-mask with just a
one-degree resolution still requires about 8 kB per mask.  One would
like higher resolution to avoid too many false positives, yet this would
seem to use too much storage and would make searches rather too slow.

Possible Data Structures

The two obvious cases are using a purely relational structure, and a
purely hierarchical one.  It may be that some intermediate case is
better, but I shall explore the extremes first.

If we go for a relational structure,  in principle the Full Registry
could contain a small number of tables of metadata, perhaps one for all
tabular datasets, one for all images, etc. The table of tabular metadata
would have one column per table and them columns for each item we want
to be searchable, e.g.

 - PublisherId
 - Identifier (of table)
 - columnName
 - columnUCD

To make this table conform to the Date/Codd rules, the primary key to
guarantee uniqueness would be the combination of the first 3 fields
listed above, I should think.  We might well want indices on Identifier
and columnUCD.

For image metadata, there would be one row per image, and columns for:
 - PublisherId
 - Identifier (of the image)
 - Instrument
 - Coverage.Spatial
 - Coverage.Spectal or maybe Coverage.Spectral.Bandbass
 - Coverage.Temporal.StartTime and .StopTime
 - probably some elements covering resolution, sensitivity, etc.

The number of individual images accessible to VO users is perhaps only a
few millions to tens of millions, so this will only be table of modest
size, but B-tree indices will be needed to allow efficient searching on
complex criteria.  Since each image only covers a small more-or-less
contiguous patch of sky, the Coverage.Spatial element could be indexed
by an R-tree (or similar) as built in to most modern RDBMS.

This sort of structure, with one table covering all resources of a given
type in the whole VO, is only feasible if the Full Registry holds it. If
we go for the less detailed storage in the Full Registry, then we need
an index entry only for each table rather than each column, and each
collection of images, not for each image.  The latter will require a
much more complex data type for Coverage.Spatial, perhaps the 8kB
bitmask. Unfortunately these would not be handled by the built-in
indexing of any RDBMS that I know about.

The opposite extreme, is to go for a hierarchy more-or-less mapping the
data.  The levels would be something like this:

level 0: data centre
  level 1: data collection (e.g. Vizier) or database
    level 2: individual table
       level 3: column

level 0: data centre
  level 1: observatory data collection
    level 2: observation
      level 3: exposure
        level 4: dataset (e.g. image, spectrum, etc.)
          level 5: FITS keyword (those worth indexing)

The appropiate language for searching would then be much nearer to Xpath
than SQL.   Using such a hierarchical scheme, it would be perfectly
feasible to store the detailed levels at the data centre, with only the
upper levels of the hierarchy stored in the Registry.

Clearly we need to start off with a simple registry, which just handles
the simple functions, but the VOQL will need column metadata to work
properly on anything beyond very extremely queries, so decisions on how
to store the necessary information need to be taken very soon, in my
opinion.

-- 
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.