Cube model - Dataset IDs

Thu Mar 19 17:44:52 CET 2015

On Thu, 19 Mar 2015, CresitelloDittmar, Mark wrote:

> Thanks for the background Doug.
>
> So the use-case for the DataID.datasetID is:
>  + data center has a registered publisher ID
>  + assigns its own [publisher] dataset ids, which may change over time
>     (these are IVOA IDs, so globally unique and with proper syntax)
>  + the dataset has ALSO been assigned a persistent dataset id from
>     a 'global index service' such as ADS which the publisher wants to
>     retain in the dataset.

Yes, but all we can say is the dataset MAY have also been assigned a
persistent dataset id from an external indexing service.  For many
datasets this value may be null.  The publisher DID however, can always
have a valid value.

>  I'll use some MAST file info as an example (but it doesn't have
> publisherDID)
>  The resulting file would contain:
>    Curation.publisher = "MAST"
>    Curation.publisherID = "ivo://mast.stsci.edu"
>    Curation.publisherDID = "ivo://mast.stsci.edu?obsid=1234"  <some 'mast'
> specific ID, (using above for basis?)>
>    DataID.datasetID = "ads/sh.hut#ngc4151_141"
>    DataID.creatorDID = "ngc4151_141"

Yes, this is a typical example.  I would say that the creatorDID should
ideally also be an IVOA indentifer, to ensure that it is globally
unique.  If it is a DID it should be an IVOA identifier.

> Questions..
> 1)  Is it possible for an archive/data center/data provider, to NOT have a
> registered publisherID?
>     In other words, NOT be able to assign identifiers.  Instead, it relies
> on an external 'global index service'
>     to provide it with identifiers for it's holdings.  In this case, there
> would be just the one identifier,
>     which could be either the publisherDID OR the datasetID.
>     Maybe this is the 'more on this' case?
>
>     Curation.publisher = "MAST"
>     Curation.publisherID = <none>
>     Curation.publisherDID = "ads/sh.hut#ngc4151_141"
>     DataID.datasetID = "ads/sh.hut#ngc4151_141"
>     DataID.creatorDID = "ngc4151_141"
>
>    I'm not sure which location this 'global index id' should go.. so put
> it at both.

Any entity that has registered VO services and serves data is a
publisher and needs to have a registered authority ID.  It is the
authority ID that is used to form a dataset identifier.

Note, our newest VO services and also ObsCore essentially require a
valid publisherDID for each dataset; this is the only reliable way to
uniquely refer to a dataset.  The same dataset may be replicated in
multiple places, with each having a different publisherDID.

I am not absolutely sure what the publisherID is, but I think it would
be an IVO identifier for a registry record describing a publisher
resource.  This is nice to have, but it is the publisherDID that is
essential to have.

> 2) My ignorance surrounding identifiers may become apparent here, but...
>     I'm not sure if a single dataset can be tagged with >1 identifier from
> any given
>     'global index service', but here are, presumably, multiple 'global
> index services'.
>     So, there is a question about multiplicity for that attribute.
>
>     If the ADS IDs are publication based, then this would be a growing
> list, as a
>     dataset is used in various research.  Keeping this sort of metadata
> accurate
>     and current would require frequent updates to the dataset itself.
>
>     While it seems useful for an archive/center to keep track of IDs which
> reference
>     a particular dataset, it doesn't seem right to store that information
> IN the dataset.
>     This sounds something like a 'Getty Image' storing metadata about every
>     usage of that particular image IN the png file. (which I don't think
> they do)

These are good points.  In principle a single dataset could be
registered in multiple "global index services" and hence have multiple
datasetIDs.  However in current practice what we have is 0 or 1 such
identifiers, so maybe we are inventing a problem that does not yet
exist.  From a data center point of view, it seems useful to know which
datasets are externally registered by some resource like the ADS, and to
be able to follow this link (that could be an actual DataLink) back to
the ADS to find related publications.

If a single dataset is referenced in multiple publications it would be
best if there were a single registered datasetID for the dataset.
Probably that is at the limit of current practice, but if datasetIDs
become more routinely used, it would be nice to provide this.  Usually a
research program will either reference existing datasets that may
already have datasetIDs assigned, or create, upload, and register new
datasets as part of their research program.

 	- Doug

>
> http://www.gettyimages.com/detail/news-photo/apollo-8-view-of-earthrise-over-the-moon-news-photo/50580029
>
> Mark
>