Registries, IVO ids, and Data Set Identifiers

Tom McGlynn Thomas.A.McGlynn at nasa.gov
Mon Sep 22 12:06:00 PDT 2003


Since I was cut off from civilization (or electricity at least) for a bit,
I've been thinking about the issues that have been raised regarding IVO
identifiers, registries and data set ids.  Below is
a review of  the issues and a suggested synthesis.  I'm sending this to both
the ADEC ITWG mailing list and the IVO registry group...

	Tom

A. IVO Identifiers.

The suggested format for the IVO identifiers (or at least one representation
of the identifiers) is a string of the form:

   ivo://authority.id/uri.string#fragmentSpecification

It sounds like  there is basic agreement here, although there
are some questions as to whether all of these need to be defined.

The ivo:// simply identifies that this is a IVO URI.  In many contexts one
would expect it to be omitted just as one can omit the http:// at the
beginning of a Web address when it is known that one is to follow.

The Authority.ID is a string with the same restrictions on the character set
as the hostname in an HTTP URL (or perhaps we are a little more
lenient).  Each authority ID is associated with one institution which is
responsible for creating URI's within the name space defined by the
authority ID. The authority ID might be based upon the institution
controlling it, e.g.,

    mast.stsci.edu or ncsa.uiuc.edu

Such authority IDs have the advantage that they
naturally avoid name collisions, but at a cost of associating the
entities being pointed to in the namespace with a particular
institution.  While a given authority ID is  associated
with only one institution, a given institution may be responsible for
multiple authority ids.  Some TBD mechanism will be created
to manage the authority IDs.  Presumably this will involve a registry
of valid name spaces and their associated institutions.  Authority
IDs are in fact a distinct type of  VO resource that needs to be managed.

The URI string has the same restrictions on it as the comparable part of
an HTTP URL. The intent is that this uniquely defines a resource
in the namespace of the given authority ID.
The big issue here is whether an otherwise valid string of the form
   ivo://authority.id/uri.string
is a IVO id if this string is not included in a registry somewhere.  At
some level this is a semantic distinction...  Clearly there will be
a period of time between when a resource is 'activated' in some fashion
and when it first gets formally registered.  Perhaps we should
use the phrase 'registered resource' in contexts where it's possible
that 'unregistered resources' are possible, and recognize that
IVO systems will generally only find  registered resources,
so that when one uses the word 'resource' one typically means just those
that are registered.

There is another important issue hiding here.  Can a resource be
registered without being published?  Here by publishing I mean,
that another registry can copy out the entire contents of the registry
in which the resource is included.

The registries we have built for services assume that the number of
entries is relatively small and that replication of all entries is both
desirable and  straightforward.  However, one can imagine
registries at a finer grained level: e.g., NED and SIMBAD are
essentially registries of astronomical
objects with tens of millions of entries and correspondingly frequent
updates.  While I believe that it makes sense that registries of services
should generally  'publish' these services -- in the sense
that another registry can harvest the results -- I do not think this is
necessarily the case for other kinds of registries.  Such registries
would still have a query interface that allows the user to query
for information based upon characteristics in the system.   Such fine
grained registries already exist: NED and SIMBAD are not just big catalogs.
How these are to be integrated within the VO is an issue that we will
need to address.

Back to the IVO id...
The fragment identifier is a bit controversial.  However in analogy with
it's role in HTTP URLs a suggested role is to specify a piece of
a resource.  The  analogy with the HTTP specification isn't perfect.
The data retrieved for a given HTTP URL is  not affected by the
fragment, only the starting location of the display of the document.  In
some of our discussions we are suggesting using the fragment ID to
point to different atoms of data  rather than to different locations
within the same document.


B. Data set identifiers.

The current driving goal of the data set identifiers is to enable the linkage of
data between the literature and the archives.  The requirements for
these id's are relatively simple.

   - They need to be permanent on a time scale longer than the archives
     that currently serve them
   - They need to be relatively easy to use.

The journals suggested that the names be divided into two fields: The
first specifies the observatory location and telescope from which the
data was taken, the  second identifies the specific observation.

To support these identifiers several ADEC institutions have built data
set verification services which check whether a given identifier is
known at the  archive.  The ADS has built a master identification
service which sends requests to the individual  services and collates
the results.  The verification services simply indicate whether the
dataset is known to the archive and if it is return a URL that may
assist users in finding the  data.  The URL can be a link to the data itself,
or to some other web page, e.g., a notice that the dataset is currently
proprietary.

The format currently used is something like:

     Sa/ROSAT:X/RH300001N00

where the  string before the ':' is the observatory location/telescope,
and the right hand is an RXTE specific string.

Currently this example string would be managed by the HEASARC, but in 20 or 30
years, responsibility for the XTE data might be inherited by some other
institution. However  there is a strong NASA institutional
committment to maintaining the XTE data essentially forever.

A long list of suggested observatory location/telescope strings has been
prepared by the journals and associating the data with the telescope
that took it seems  appropriate for these IDs.  However
this is no requirement for the specific formatting of these IDs.  Now
that the IVO ID's have sufficiently matured adapting the existing services
to use the IVO format would be straightforward (I believe).

C. Suggestions for IVO data set ids.

The discussion of putting dataset ids into the IVO has  proposed several
distinct formats...  Let's consider a ROSAT X-ray observation with an
internal observation id of rh300001n00.  Note ROSAT also had a EUVE telescope
thus we prefix all of the IDs with the 'x' to indicate data from the
X-ray telescope.

Formats that have been discussed include

    1. ivo://sa.rosat/x/rh300001n00
    2. ivo://sa.rosat.x/rh300001n00
    3. ivo://sa.rosat/x#rh300001n00
    4. ivo://sa.rosat.x/#rh300001n00

There is a natural correspondence between the authority ID and the
observatory_location/telescope field in the dataset ids.
In most cases we would anticipate that when the responsibility for
data changed hands it would tend to be for data at this level of granularity.
However, we should probably recognize that no matter how carefully we plan,
if we really want to support  dataset ids unchanged into the
indefinite future, then there is going to be some level at which we can
not exactly anticipate the how data will be reorganized by our successors.

Since Sa/ROSAT was one of the initial descriptors provided by the
journals, that might be thought to favor forms 1 or 3 above.
However, if we think about how the data is to be handled in the future,
it seems not at all unlikely that a different institution may take over
the ROSAT EUVE data from the ROSAT X-ray data.  Options 2 and 4 give more flexibility by
separating these two datasets into different namespaces.  My
suggestion would be that the decision of the level of granulatity to include
in the authority id  should be left to the original
archiving institution who presumably have the best sense of how the data
ought be organized. The guideline might be to usethe the least granularity consistent
with the ease of long  term management of the ids.

Among the ADEC institutions, there was quite general
consensus that we could not mandate a specific format for what
the right hand part of the ID strings should look like.  Since there are only
very limited constraints on what comes after the authority ID in an IVO
identifier this matches nicely.  However, there does seem to be
an issue as to where we specify the details for a  specific dataset:
in a fragment or in the URI string.

I think this goes back to the question of whether an individual
observation dataset is a 'resource'.  If the resource is
a collection of observations then the  individual dataset is a fragment of
the larger set.  However, I think we are eventually going to need some
way of identifying a particular observation
dataset, so it seems more productive to allow an individual dataset to
have a IVO id, unless we want to have a nomenclature for dataset ids that
is entirely separate from resource ids.

So I think that there are at least five distincts types of
resources that we need to manage that come up in this discussion.

  Authority ids:  Clearly we need some mechanism for managing these.
There seems to be reasonable agreement about what these IDs
might look like.
      There are  some basic questions to be  addressed here,
      Do they form a  hierarchical structure like host names.
      so that the institution owning sa.rosat
      automatically controls sa.rosat.x?  If so and we
      also want domain name based ids perhaps we
      should take the lead of java packages
      and make the names go from most general to most
      specific.  E.g., ST could reserve
      a whole set of authority ids of the form:
                       edu.stsci

      Or the ID's defined using location/telescope should be reversed to
      go from specific to more general (i.e., rosat.sa, not sa.rosat).

     We might also wish to consider a general domain
     that is unprotected, so that users
     outside any formal VO umbrella are free to publish,
     but this unregulated status is also
     easily seen, similar to the alt hierarchy in USENET.

  Institutions:  If authority IDs are linked to institutions, then we
need to have some mechanism of registering these.  While both authority IDs
and Institutions  need to be registered somewhere, it is
not clear that these are the same registries that
we are currently using for services.  I'd expect them to
be used in rather different ways.  We also need to decide what
the standardized name for an insitution is, but this is probably
not an urgent issue.

  Data set collections: While we have defined cone search and SIA
services that use archives, I believe there is a real distinction
between the archive and the service.  E.g., users getting HEASARC data can
download the same file using HTTP or FTP protocols
or through serveral different on-line services.
The data collection is a resource independent of the
services used to access it.  We presumably want the dataset collection
ID to be part of the id for the individual dataset.  For the identifiers
above I would see two possibilities:  If we use fragments to identify
the individual dataset, then the collection ID is whatever comes before
the ID.  Something a little more in keeping with regular URLs might be
to use a syntax like
     ivo://sa.rosat/x?set=rh3000001n00
indicating that the id is a qualification of collection.  Here a standard
keyword like 'set' or 'id' would be chosen, but there would be natural
path for expansion.  Here everything before the '?' would be the dataset collection.
One final choice might be to have the collection be everything before the final '/'.
E.g., in ivo://sa.rosat/x/rh300001n00 the collection id would
be ivo://sa.rosat/x.  Regardless of how the collections are identified there should
probably be a new resource type defined for them and they should be registered
in a fashion similar to (and probably in the same registries as) existing services.

  Individual datasets: This is a point I expect to be debate but I think
  we need to have identifier that can indicate a particular
data set to be  analyzed/extracted/converted/moved/subsetted ... in
some standard way.  This goes well beyond the
requirements for  journal article indexes, but
these should be exemplars of the more general set.
I'd imagine that that only certain authority ids
would be associated with the permanent data set ids
that would be appropriate for use in journals.
Others would be more transient.  These individual
datasets would generally not be registered
in harvestable registries -- though some might be.

  Data set services: The data set verification services that the have
been built by ADEC insitutions are the first
of what promise to be a large number of dataset
services.  These services should be registered just
like SIA or Cone search services once the protocol or
protocols used have been tied down a bit.  Doubtless this
will require some evolution of the existing services, but I don't
see this being a big problem so long as they remain compatible
with the very simple request that is currently being made of them.






More information about the registry mailing list