Data set metadata schemas

Anita Richards amsr at jb.man.ac.uk
Tue Jun 17 12:05:43 PDT 2003



I have learnt from the debate on Registry Schema but I can't add much
except to say that I think the differences will only be resolved in
use.  We should start applying the schema we have got to real data
sets and science queries (even if these have to be carefully selected
at first) - and that seems to mean starting from
http://www.ivoa.net/internal/IVOA/IvoaResReg/ResourceServiceMetadataV7.pdf
(RSMV7). So to that extent I agree with Bob, and maybe that is not
controversial as I think I am only talking about what Tony says it is
OK for, i.e. "Sky coverage services" if that is taken to include
spatial spectral and temporal coverage and other metadata tied to data
sets. That is, what information does the Registry need to hold to
select potentially useful data sets and use their metadata to select
the appropriate services to then access the data for processing (but I
am not commenting on the parts of the registry which describe
appropriate services).

I would like to understand better how feasible it is to link sections,
for example an entry in COMMUNITY may be the same as one for
Contributor in CURATION.

We also need to think a bit more about how to aquire the dataset
metadata. At least at first, we want to make sure this is done in an
examplary fashion because we will be judged by the results, so it is
no good using difficulty in getting information as an excuse.  In my
experience with the 4 data sets so far, the relevant information is
not held in one place, it requires human searching of web-sites and
human discrimination, for example to decide what is the region of
regard for a catalogue - PSF (but what about systematic errors)? Pixel
size (but this is arbitrary in radio images)? Largest error given (may
be spurious/huge)? Eventually we will have algorithms to help decide
but these will be evolved through experience, not trying to imagine
all possible circumstances.  For data sets which are actively curated,
we can ask someone to fill in a questionaire, however, again, we will
only discover what is open to misinterpretation after a few rounds
with archivists.  More seriously, we do not yet have the kudos to get
people to fill them in unless they are already VO enthusiasts and even
then they often just point you at web sites with (usually) far too
much detail.

However, I suggest that we start by designing a plain text form which
gives examples/selections where appropriate.  This could be
interpreted and written to xml using a perl script, which would also
catch the commonest ambiguities (metre/meter) and unit conversions.
We can progress to a web form as long as it is really
platform-independent and avoids problems with over-long selection
lists, instability if completion is interrupted etc.  We are going to
have to solve this problem anyway of course for user input! The
protocols for submitting data sets to CDS are one precedent, I would
welcome comments from people involved with that.

The AstroGrid Registry work-group have created a set of Resource
Registry schemas for AstroGrid.  These are based on RSMV7 with a few
additions suggested by trying to use them to describe four real
datasets.  I apologise for the baby xml, I am trying to learn - all
mistakes are my responsibility alone. I also apologise for possibly
reinventing (but less adequately) the schemas linked to
http://www.ivoa.net/internal/IVOA/IVOARegWp03/MDinXML-Summary.html -
however I think I am covering a small part of this in more detail.  I
also note that these are based on RSMV6 which explaiins some of the
differences in organisation.

 You can find my schemas for AstroGrid at
http://wiki.astrogrid.org/bin/view/Astrogrid/RegistryIt02Schema - see
a little way down the page:

------------------------------
------------------------------

"Iteration 2 resource registry schema

...
resourceRegistry.xsd and the component schemas for describing an
astronomical/solar/STP resource: identity.xsd, curation.xsd,
content.xsd, service.xsd."
...

"Examples of the identity, curation and content xml files (in a single
file) have been prepared for the 1XMM (x-ray sattelite), SURF (Solar),
USNO-B (reference stars), WFCSUR (Isaac Newton Telescope survey)
archives."

and
http://wiki.astrogrid.org/bin/view/Astrogrid/RegistryUnits
which explains where/why I have added to RSMV7.  In summary, the
differences are:

CURATION

1) I have added some elements to describe the size of data sets - in
   Mb, and for tabular data nRows/nCols, or nPixels for 'image' data
   (extensible to any number of dimensions).  This is to aid
   optimising the order of query execution and in case servers have
   limits on the size of data which can be returned/need to invoke a
   cutout server for images etc.

CONTENT

2) Added element for UCDs - this will be for dumb matching at first,
   can become more sophisticated or moved to a different level as UCDs
   become more sophisticated.

3) Added spatial region Healpix - this is the CMB way of indexing the
   sky, added at the request of the Planck people.  At the coarsest
   there are 12 regions.

4) In a future iteration we should extend region of regard to the
   spectral and temporal regimes.  NB I don't think this is the same
   as resolution in most cases; for source lists the error may be
   greater than the resolution (e.g. systematic errors due to
   reference source position uncertainty) or less (point source at
   good signal-to-noise); for images the spatial size of a single
   image is the same as the resolution for e.g. 1D radio spectra, but
   not for a radio synthesis or a CCD image.

5) Added UNKNOWN to cframe types, spectral waveband coverage, might
   want this elsewhere as well.  At present this is mainly because I
   do not know how to deal with solar data but it might be a useful
   general distinction between 'exists but unknown' v. 'NULL'.

6) In a future iteration, add after object count coverage etc., the
   spatial fraction of the BOX etc. covered by images, and similarly
   for spectral and temporal coverage.

7) Added Resolution (spatial spectral temporal)

8) Added Data Quality (spatial spectral temporal)

Other future additions

   * Allow coverage to include multiple non-contiguous regions in
     spatial spectral and temporal domains, e.g. to allow for discrete
     radio wavebands covering 1.3-1.7 GHz, 4.5-6.7, 21-24 GHz etc.
     These are not bandpasses because any individual observation
     probably only covers a smaller region e.g. 16 MHz within this.
     Similarly for optical observations of variable stars which can
     only be observed when they are in the night sky, etc. This is a
     more spohisticated variant on 6) above.

   * Allow element values to be inter-dependent, e.g. the radio
     resolution depends on wavelength and in the above example varies
     by a factor of almost 20.

   * We discussed adding (probably to CURATION) a set of elements to
     cover linked data sets, for example if the properties of observed
     sources and the coverage of the facility used are in separate
     tables. However this may be covered by the separate Data
     Collection section? Or is this more like an entire data centre
     e.g. MAST, LEDAS?

SERVICE

9) Added maximum image size allowed by service to the restrictions.
   There could be more, e.g. maximum time interval to search etc.

best wishes

Anita

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Dr. Anita M. S. Richards, AVO Astronomer
MERLIN/VLBI National Facility, University of Manchester,
Jodrell Bank Observatory, Macclesfield, Cheshire SK11 9DL, U.K.
tel +44 (0)1477 572683 (direct); 571321 (switchboard); 571618 (fax).



More information about the registry mailing list