Harvesting

Clive Page cgp at star.le.ac.uk
Tue Sep 9 09:04:47 PDT 2003


> Firstly, does everyone agree that harvesting is the next big issue?

I also agree that it's the next big issue.  And it think the VO has to be
designed to make automatic harvesting feasible.  We may be able to make an
initial registry by hand, but in the long term the hard-pressed
administrators of data centres are going to find it hard to keep their
registry entries up-to-date and accurate.  We should follow the examples
of internet search engines and make it automatic.

What is needed, I think, is a standard Web Service for each site which
allows extraction of metadata.  This needs to allow exploration of a
hierarchical structure, much as search engines explore directory
structures.  Some of our structures are have several levels, e.g. the CDS
has components of Aladin, Simbad, and Vizier;  Vizier has components of
thousands of tables;  each table has many columns; each column has
components of name, UCD, units, maybe comments.  If we can design a
suitable Web Service allowing a search engine to explore this metadata
stucture, we make automatic harvesting feasible, and perhaps not too
difficult.  From the top-level metadata WSDL, everything should be
extractable.

Accuracy is an important issue: anything that humans do has errors.
If humans have to update the registry its quality will be lower than if
done by machine.  Of course there will be errors in the metadata, which
will get propagated to the registry, but then there is only one error to
fix, and after it's done, the registry will automatically get fixed at the
next update.


-- 
Clive Page
Dept of Physics & Astronomy,
University of Leicester,    Tel +44 116 252 3551
Leicester, LE1 7RH,  U.K.   Fax +44 116 252 3311



More information about the registry mailing list