Identification model in VO Spectrum Model
Rob Seaman
seaman at noao.edu
Sat Oct 28 06:22:13 PDT 2006
Doug Tody wrote:
> I suggest reviewing the protocol document and commenting on what is
> required for a minimally or fully compliant service for the specific
> use-case of data discovery and selection.
Pavlos Protopapas wrote:
> Now about the issue of ID that I raised.
> A simple scenario. Lets say somehow
> I get two spectra. I do not know how and why. May be
> my program generates them therefore SSA was not involved
> in this. Now I want to make sure that I do not have duplicates.
> I do need an ID, don't I ?
A separate ID field is certainly helpful and perhaps even
"required" (for instance, to guarantee uniqueness of sample selection
for some statistical study), but it may not be strictly necessary.
Combining information from a small selection of other metadata may
provide as unique an identifier as an archive-or-service-supplied
identifier. For raw data, telescope+timestamp is often sufficient.
NOAO has been adding such an OBSID keyword to our headers for many
years. (Nobody will dispute the value of having the disambiguation
string precomputed.) For telescopes in which multiple instruments
may be used in a single observing session, we add an instrument ID to
the mix. For instruments that take multiple exposures or that can
take rapid sequential exposures, we add an instrument supplied
running number.
VO in general and spectra in particular are typically not focused on
raw data, of course, and multiple data products can result from a
single raw input. In that case one might consider disambiguating by
adding a processing code. Then you run into the versioning problem -
perhaps a pipeline was run twice with different calibrations. So add
a versioning code. There is always some way to disambiguate.
The point I'm trying to reach is that an ID is no guarantee of
uniqueness unless the entire chain of data handling and processing is
always controlled - and in that case other metadata may serve equally
well.
The only true ID is supplied by each dataset itself, for instance as
a checksum, hash function, message digest or digital signature of the
pixels (however represented for a spectrum). I've often used IRAF
imstat to report skew and kurtosis as well as the more typical low
order statistical moments when I need true confirmation that an image
I'm handling in one context is the same as another image presented to
me in a different context.
One could imagine protecting the metadata using similar techniques,
for instance, by "blinking" one FITS header against another (overlay
two xterm windows and toggle each in turn). But unlike the data
values themselves, metadata may not preserve ordering, header
keywords may be rearranged, etc. Semantics implies keyword
selection, but then you are just back to the original discussion above.
But of course the NOAO Science Archive, and the "Save the bits" data-
store before it, adds a serial number to each ingested data product.
In some real sense, however, each file's MD5 or each HDU's FITS
checksum is the only real identifier once a dataset escapes into the
wild. An archive's (or VO service's) internal identifiers are only
rigorously reliable for data kept close to home. Data security and
data identification are two aspects of the same issue.
Rob Seaman
More information about the dm
mailing list