Registries, IVO ids, and Data Set Identifiers

Robert Hanisch rjhanisch at worldnet.att.net
Mon Sep 22 17:17:50 PDT 2003


I was about to write something along the lines that Tom has already done, so
eloquently, so I will keep my remarks very short.

I think the crux of the problem in bringing together the data set
identifiers with registries is the issue of granularity.  I think registries
thus far have been conceived (properly) to deal with _collections_ (the HST
archive, the Chandra archive, the NOAO archive) and what we want for journal
articles is persistent identifiers for _datasets_, individual observations
(which may include various amounts of ancillary data) in these collections.
However, I
think the problem is easy to solve in a number of ways.

For journal links to datasets we want identifiers that maximize persistence
(have a link that withstands data migration and mirrors); the bibcode is an
excellent example, and benefits from the fact that "once published in ApJ,
always published in/by ApJ."  For VO resources we want to maximize
curation/responsibility -- who is behind this resource?   For journals we
already know -- ApJ or AJ or A&A is behind it, and we know what that means.
For VO resources it could be clear, such as Vizier services at CDS, or
ambiguous, such as "Bob's Best Astro Info" at bbai.net.  And finally, we
want to keep VO registries at a "manageable" level, which at least to start
argues for 10^2-10^4 entries rather than 10^6-10^10 entries.

So, I think we should focus on identifiers that meet these goals, provide
flexibility, and are easy enough to parse into their ultimate URLs.  In my
mind we need three components to the identifier to do this:

1) An authority ID
2) A resource key
3) A dataset (subset) ID

Two examples:

ivo:///hst/mast.acs#q1234567.fits

  or

ivo://hst.stecf/acs#q1234567.fits

In the first 'hst' is the authority, and those of us who distribute HST data
would have to agree among ourselves on the resource keys so we do not step
on each other's toes.  Not hard, but slightly inconvenient.

In the second, the curator of the collection moves up and no negotiation
about resource IDs is necessary.  A bit weaker on persistence.

(I don't much like the 'sa' qualifier, as in sa.hst; I don't think this adds
much.  I'd rather just see a well specified telescope/facility in the
authority ID, and curatorship handled separately, as in kpno.4m/noao... or
kpno.4m.noao/...)

Alberto M. would argue against using instrument names, as in some sense they
are redundant for HST observations.  This may not be true for other
missions, projects, or telescopes.  In any case, in the second example STECF
is free to choose whatever it wants for the resource key, as long as there
is one.  They could use ivo://hst.stecf/hst#q1234567.fits, say.  This
asserts that they will always hold HST data as an integral collection,
whereas Tom has noted that certain instruments from a space mission might be
split up.

Having separate authority and resource components allows
o  Different providers of the "same" data to offer enhanced data products,
and authors to be clear about which they are citing.
o  Quite simple maintenance of the identifiers, so that if MAST disappears
the identifier resolution service need only substitute a forward mapping
(mast.acs --> heasarc.acs).
o  Immediate identification of mirror, and near-mirror (enhanced mirror)
sites.

I formerly argued for using domain names for the authority; I now think this
is not the best solution.  I think if we keep the authority ID as generic as
possible we maximize persistence, yet through the resource key can show
curatorship and specificity (if we wish... ivo://hst/mast/q1234567.fits
would do just as well in the above example).  We need to distinguish
datasets from collections to keep the granularity at a manageable level.

Sorry -- longer than planned!  Hope to talk to many of you tomorrow in our
11am EDT telecon!

Cheers,
Bob



More information about the registry mailing list