On ID "sameness"

Arnold Rots arots at head-cfa.cfa.harvard.edu
Wed Feb 5 07:42:57 PST 2003


I am still uneasy with the requirements document.  In trying to be
very general, it appears to gloss over a number of differences that
may be fundamental.  The problem, in my mind, is that the properties
of the ID (in being able to convey certain attributes of a particular
instance of an object) take on different meaning, depending on what
kind of object one is talking about.

The ID is supposed to identify an object; and an object can be many
things: a collection of observations, a single observation, a
collection of files, or a single file.

One problem I see is that sameness does not apply equally to these
different types of objects.  There is no point in talking about two
observations being byte-for-byte identical; two IDs might refer to the
same observation (whether that's a sensible thing is another issue),
but if one views an observation as a collection of files derived from
or associated with that observation, the two collections may not
necessarily contain the same files.  Yet, they represent one identical
observation.

If we talk about single file objects, the answers are easy: take the
filename as identifier - that should include information on format and
version, assuming that all filenames in a depository are unique.

But when talking about file collections, it gets more complicated.
"The current default package of primary products"  is a something that
does not really have a version attached to it, or a format.  It
contains files that have a version and a format, but this particular
package may contain different mixes at different times.
Maybe the format could be "primary data products package", but that is
a vague concept that is very different from a universally understood
format like jpeg.  And maybe its assembly date could be taken as a
version, but that says nothing about two instances of its being
byte-for-byte identical.

I am not sure I am putting this very clearly, but I guess I am still
not convinced that we can specify a one-size-fits-all ID to label
everything in the universe, especially when that labeling is to
include certain properties that are not necessarily shared by all
those "things".

  - Arnold

Ray Plante wrote:
> Hi,
> 
> We still have this issue of "sameness": that is, when should two
> instances of an object be consider the same, and thus be refered to by
> the same identifier?  Reagan brought up the concept of a "semantic
> copies", copies that are semantically the same but might have a
> different byte-representation.  Tom indicated what might be considered
> semantically equivalent might depend on the context; he suggested that
> we should leave it up to the user to decide when objects are the same
> rather than locking it in up front.  
> 
> Drawing on ideas presented earlier by others, I'd like to recommend
> the following principles on defining "sameness".  They draw on the
> requirements discussed in my previous message.
> 
> 1. Two identifiers refer to the same thing when the identifiers are
>    character-for-character identical.
> 
> 2. Two local IDs are identical only when the context of the ID is the
>    same.  Global IDs lock in a specific context, and thus can be
>    compared in an absolute sense.
> 
> 3. A description of a resource, service, or data collection might
>    reference identifiers associated with various aspects of the
>    subject.  Examples might include:
>      *  observation ID
>      *  "derived from" or "mirror of" ID.
>      *  parent collection ID
> 
> 4. When two instances of an object can be considered the same is up to
>    the curating resource and will depend on the object being
>    identified.  Curators should consider the following
>    recommendations:
>      * Two resources can be given the same ID if their
>        descriptions are identical apart from the access point.
>      * Two services can be given the same ID if:
>          o  their descriptions are identical (including the interface
> 	    inputs and outputs) apart from the access point.
> 	 o  the implementaions are identical, or otherwise return
> 	    byte-for-byte output for any given set of inputs.  
>      * Two data collections (i.e. anything that that is
>        byte-instantiatable) can be given the same ID if they are
>        byte-for-byte identical.
>    It may be necessary to establish rules or conventions that control
>    who is allowed to declare a "mirror" of something.  
> 
> 5. Because an identifier can be assigned to a variety of things, be
>    they abstract/virtual (e.g. resource IDs, observation IDs) or real
>    byte-instantiatable (e.g. collection IDs), services that
>    specify an ID as part of input or output should be very clear as to
>    what the ID refers to (e.g. an image, a table row, an
>    observation from which a data item is derived, etc.).  
> 
>    In particular with respect to data in a VOTable, it should maximize
>    the reader's options for determining whether two rows are effectively
>    the same for a particular purpose.  This will likely require of
>    definitions of various ID UCDs.  
> 
>    One useful UCD might be one for a "semantic identifier", that
>    refers to another data item that, in the eyes of the writer, a can
>    be considered equivalent to the item being described.  This could
>    be used by the SIA to group images that differ primarily in
>    format.  
> 
> hope this helps,
> Ray
> 
--------------------------------------------------------------------------
Arnold H. Rots                                Chandra X-ray Science Center
Smithsonian Astrophysical Observatory                tel:  +1 617 496 7701
60 Garden Street, MS 67                              fax:  +1 617 495 7356
Cambridge, MA 02138                             arots at head-cfa.harvard.edu
USA                                     http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------



More information about the registry mailing list