OAI for Virtual Observatory

Roy Williams roy at cacr.caltech.edu
Sun Jan 26 08:19:34 PST 2003


Open Archives Protocol and the Virtual Observatory

This note is a summary of my experience of the OAI protocol. It looks
to be simple, flexible, and extensible, and all expressed in simple
XML. I would like to team up with some others in the project who would
like to use OAI tools in a practical project, perhaps as a registry
for the Cone Search and SIAP implementtions, perhaps as a new face for
Vizier .... or another protoype project?

Roy
-------------------------------------------------------

The OAI protocol is described in
http://www.openarchives.org/OAI/openarchivesprotocol.html.
The protocol is used to describe collections of digital resources, and
there is a large number of servers of this protocol available from the
Repository Explorer from Virginia Tech, you can explore it at
http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai
This is the tool that I have used to try to understand what OAI-PMH
actually does.

The protocol handles metadata bundles that can refer to arbitrary
entities. Each metadata bundle includes a header, with an identifier
that can be used to refer to it, as well as a date stamp. Identifiers
can be built in may ways, but most practitioners have chosen a
hierarchical scheme of some sort. Here is an example from a math
server at Cornell called Project Euclid:
oai:CULeuclid:euclid.em/999188417

Examples of metadata bundles from the Repository Explorer include
journal articles, museum artifacts, classical Greek texts, and lost
languages. The metadata bundle is an abstract concept, that can be
"viewed" in different ways, each expressed by a "Metadata Format" (MF)
which we can think of as a "point of view" on the metadata, and a
corresponding query mechanism. Each MF is expressed by an XML Schema,
so that a metadata record is always an instance of one of the schema
that the repository supports. The most ubiquitous MF is the Dublin
Core, with 15 keywords -- Title, Creator, Subject, Description,
Publisher, etc.

In the Virtual Observatory, we could imagine some MFs to represent
information about astronomical datasets:

(1) One might express the provenance of the dataset, the organization
that funded it, the responsible parties in the data reduction,
authentication requirements, and so on.

(2) A different point of view of a dataset would express a MF about
sky coverage, wavelength, and pixel scale. Querying this would enable
the "Gamma-Ray Burst" demo, which entails getting information about a
point in the sky.

(3) A third MF could be for those who store and replicate datasets.
The metadata record would concern the size and granularity of the
dataset, the nature of the services that provide it, and the replicas
of that data.

The purpose of the OAI protocol is to expose metadata through queries,
with the result in any of the MFs that have been implemented. There
are six "verbs" in the protocol as follows. Below I illustrate the
verbs with the Project Euclid repository of math e-journals. The given
links produce the XML output directly, however you can see all the
same information in human-readable form with the Repository Explorer.

** Identity
Each repository has an "identity" record, with basic information such
as Name, URL, administrators email, and other information. Try for
example
http://ProjectEuclid.org/Dienst?verb=Identity

** List Metadata Formats
This verb brings an XML-encoded list of the MFs that can be delivered
from this repository. In Project Euclid, there are two MFs, Dublin
Core and another for more precise bibliographic information -- it has
28 keywords, including copyright, funding, number of pages, etc etc.
Each MF has an identifier, for example Dubling Core is "oai_dc".
http://ProjectEuclid.org/Dienst?verb=ListMetadataFormats

** List Sets
Each repository is assumed to hold the metadata for several "sets", or
metadata directed to coherent collections of entities. For example, I
have looked at the Project Euclid OAI server, which hosts mathematical
online journals. The sets that it hosts include "Annals of
Mathematics", "Experimental Mathematics", and the "Journal of Applied
Probability". Thus each OAI "set" is a journal, and in the "set" are
the metadata for the corresponding journal articles. Each set has an
identifier, for example "Experimental Mathematics" is identified as
"em".
http://ProjectEuclid.org/Dienst?verb=ListSets

** List Identifiers
This verb requires a parameter which is one of the sets, and it
returns the identifiers in the set. The following request returns all
the metadata identifiers from the Experimental Math Journal (em) in
the Dublin Core format (oai_dc).
http://ProjectEuclid.org/Dienst?verb=ListIdentifiers&metadataPrefix=oa
i_dc&set=em

** List Records
Same as List Identifiers, but returns all the full metadata records,
rather than just identifiers.
http://ProjectEuclid.org/Dienst?verb=ListRecords&metadataPrefix=oai_dc
&set=em

** Get Record
Takes an identifier of a metadata bundle, and a requested Metadata
Format, and returns the metadata rendered in the appropriate format.
http://projecteuclid.org/Dienst?verb=GetRecord&metadataPrefix=oai_dc&i
dentifier=OAI:CULeuclid:euclid.em/999188417

---------
Caltech Center for Advanced Computing Research
mail    roy at cacr.caltech.edu
voice   +1 626 395 3670



More information about the registry mailing list