Identification model in VO Spectrum Model

Sat Oct 28 06:22:13 PDT 2006

Doug Tody wrote:

> I suggest reviewing the protocol document and commenting on what is
> required for a minimally or fully compliant service for the specific
> use-case of data discovery and selection.

Pavlos Protopapas wrote:

> Now about the issue of ID that I raised.
> A simple scenario. Lets say somehow
> I get two spectra. I do not know how and why. May be
> my program generates them therefore SSA was not involved
> in this. Now I want to make sure that I do not have duplicates.
> I do need an ID, don't I ?

A separate ID field is certainly helpful and perhaps even  
"required" (for instance, to guarantee uniqueness of sample selection  
for some statistical study), but it may not be strictly necessary.   
Combining information from a small selection of other metadata may  
provide as unique an identifier as an archive-or-service-supplied  
identifier.  For raw data, telescope+timestamp is often sufficient.   
NOAO has been adding such an OBSID keyword to our headers for many  
years.  (Nobody will dispute the value of having the disambiguation  
string precomputed.)  For telescopes in which multiple instruments  
may be used in a single observing session, we add an instrument ID to  
the mix.  For instruments that take multiple exposures or that can  
take rapid sequential exposures, we add an instrument supplied  
running number.

VO in general and spectra in particular are typically not focused on  
raw data, of course, and multiple data products can result from a  
single raw input.  In that case one might consider disambiguating by  
adding a processing code.  Then you run into the versioning problem -  
perhaps a pipeline was run twice with different calibrations.  So add  
a versioning code.  There is always some way to disambiguate.

The point I'm trying to reach is that an ID is no guarantee of  
uniqueness unless the entire chain of data handling and processing is  
always controlled - and in that case other metadata may serve equally  
well.

The only true ID is supplied by each dataset itself, for instance as  
a checksum, hash function, message digest or digital signature of the  
pixels (however represented for a spectrum).  I've often used IRAF  
imstat to report skew and kurtosis as well as the more typical low  
order statistical moments when I need true confirmation that an image  
I'm handling in one context is the same as another image presented to  
me in a different context.

One could imagine protecting the metadata using similar techniques,  
for instance, by "blinking" one FITS header against another (overlay  
two xterm windows and toggle each in turn).  But unlike the data  
values themselves, metadata may not preserve ordering, header  
keywords may be rearranged, etc.  Semantics implies keyword  
selection, but then you are just back to the original discussion above.

But of course the NOAO Science Archive, and the "Save the bits" data- 
store before it, adds a serial number to each ingested data product.   
In some real sense, however, each file's MD5 or each HDU's FITS  
checksum is the only real identifier once a dataset escapes into the  
wild.  An archive's (or VO service's) internal identifiers are only  
rigorously reliable for data kept close to home.  Data security and  
data identification are two aspects of the same issue.

Rob Seaman