Identification model in VO Spectrum Model

Sat Oct 28 08:22:25 PDT 2006

Hi Rob -

This is a fine general statement about instrumental datasets which gets
into many of the issues, but we are talking about a standardized data
model and dataset here.  Note that for spectra we do not generally want
to return instrumental datasets (since there is no standard format)
but instead data conforming to the spectral data model.

OBSID is a good example.  When NOAO data is published to the VO, the
OBSID should be converted to a CreatorDID - this is the same thing,
just in the form of an IVOA identifier, with NOAO as the authority.
If the CreatorDID is given, then this, plus possible the creation date
and version (if there are multiple versions), is enough to uniquely
identify the dataset.  It would also be good to identify the collection,
which in this example would be the instrumental data collection.

Essentially everything you mention below is covered in the SSA
documents, except for something like a checksum for the dataset file,
since in VO we are generally dealing with virtual data, or data which
may be curated and published by someone other than the creating entity
(and often modified at the bitstream level even though the content of
the data is not significantly changed).  Hence we make a distinction
between fundamental dataset identity (the survey or observatory which
created it), and curation/publication.  Provenance, e.g., production
of virtual data from a more fundamental dataset by a service, is
also covered.  For example, cutouts from two different regions of the
same fundamental dataset.

Dealing with all this generic dataset metadata (also physical
characterization) was a major goal of the SSA effort.  Most of this
is not specific to spectra at all, but applies to any type of data.
Be sure to review the protocol document as well as the data model,
in particular the "concepts" section.

 	- Doug

On Sat, 28 Oct 2006, Rob Seaman wrote:

> Doug Tody wrote:
>
>> I suggest reviewing the protocol document and commenting on what is
>> required for a minimally or fully compliant service for the specific
>> use-case of data discovery and selection.
>
> Pavlos Protopapas wrote:
>
>> Now about the issue of ID that I raised.
>> A simple scenario. Lets say somehow
>> I get two spectra. I do not know how and why. May be
>> my program generates them therefore SSA was not involved
>> in this. Now I want to make sure that I do not have duplicates.
>> I do need an ID, don't I ?
>
> A separate ID field is certainly helpful and perhaps even "required" (for 
> instance, to guarantee uniqueness of sample selection for some statistical 
> study), but it may not be strictly necessary.  Combining information from a 
> small selection of other metadata may provide as unique an identifier as an 
> archive-or-service-supplied identifier.  For raw data, telescope+timestamp is 
> often sufficient.  NOAO has been adding such an OBSID keyword to our headers 
> for many years.  (Nobody will dispute the value of having the disambiguation 
> string precomputed.)  For telescopes in which multiple instruments may be 
> used in a single observing session, we add an instrument ID to the mix.  For 
> instruments that take multiple exposures or that can take rapid sequential 
> exposures, we add an instrument supplied running number.
>
> VO in general and spectra in particular are typically not focused on raw 
> data, of course, and multiple data products can result from a single raw 
> input.  In that case one might consider disambiguating by adding a processing 
> code.  Then you run into the versioning problem - perhaps a pipeline was run 
> twice with different calibrations.  So add a versioning code.  There is 
> always some way to disambiguate.
>
> The point I'm trying to reach is that an ID is no guarantee of uniqueness 
> unless the entire chain of data handling and processing is always controlled 
> - and in that case other metadata may serve equally well.
>
> The only true ID is supplied by each dataset itself, for instance as a 
> checksum, hash function, message digest or digital signature of the pixels 
> (however represented for a spectrum).  I've often used IRAF imstat to report 
> skew and kurtosis as well as the more typical low order statistical moments 
> when I need true confirmation that an image I'm handling in one context is 
> the same as another image presented to me in a different context.
>
> One could imagine protecting the metadata using similar techniques, for 
> instance, by "blinking" one FITS header against another (overlay two xterm 
> windows and toggle each in turn).  But unlike the data values themselves, 
> metadata may not preserve ordering, header keywords may be rearranged, etc. 
> Semantics implies keyword selection, but then you are just back to the 
> original discussion above.
>
> But of course the NOAO Science Archive, and the "Save the bits" data-store 
> before it, adds a serial number to each ingested data product.  In some real 
> sense, however, each file's MD5 or each HDU's FITS checksum is the only real 
> identifier once a dataset escapes into the wild.  An archive's (or VO 
> service's) internal identifiers are only rigorously reliable for data kept 
> close to home.  Data security and data identification are two aspects of the 
> same issue.
>
> Rob Seaman