Cube model - Dataset IDs

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Mon Mar 23 13:12:59 CET 2015


Hi Mark,

Thanks for untangling this.

May I add my 2 cents, wantonly mixing Registry and publisher
perspectives?

On Thu, Mar 19, 2015 at 11:25:58AM -0400, CresitelloDittmar, Mark wrote:
>   I'll use some MAST file info as an example (but it doesn't have
> publisherDID)
>   The resulting file would contain:
>     Curation.publisher = "MAST"
>     Curation.publisherID = "ivo://mast.stsci.edu"

I don't think that should be a separate metadata item, as it can be
inferred from the pubDID.

Also, if it *is* given, it can't be that particular IVORN, as it
references an Authority record (as all IVORNs without a resource key
must).  A publisher, on the other hand, would be an organisation, so
the IVORN would more like ivo://mast.stsci.edu/org (a quick search
hasn't turned up an actual vr:organisation record that might be
pertinent to MAST, so I made that up).

>     Curation.publisherDID = "ivo://mast.stsci.edu?obsid=1234"  <some 'mast'
> specific ID, (using above for basis?)>

That one would presumably not be based off the authority, either,
though I don't think there are or should be formal rules against
that.  Nevertheless, I'd recommend to base the DIDs off the service
or data collection they're in, so maybe

  ivo://mast.stsci.edu/particular_mission?obsid=1234"

>     DataID.datasetID = "ads/sh.hut#ngc4151_141"

I'm not sure if I'd always point at ADS when talking about persistent
ids -- if they don't mind, it might be ok, but frankly these days I'm
thinking more about DOIs minted in any way convenient to the data
provider.  Anyway, my feeling is we should have input from ADS before
committing to a descriptive prose for DataID.datasetID.

>     DataID.creatorDID = "ngc4151_141"

The creatorDID should be an IVORN as well, or at least that's how I
read 4.1.2.13 of REC-SSA-1.1, so this might then look like

  ivo://particluar_mission.nasa/cubes?ngc4151_141

> Questions..
> 1)  Is it possible for an archive/data center/data provider, to NOT have a
> registered publisherID?

Ah well.  We might want everyone to do the right thing with the
Registry, but I think we should be planning for them not doing it.

>      In other words, NOT be able to assign identifiers.  Instead, it relies
> on an external 'global index service'

As you know, the thing I'm always after is to have standard file
formats for all kinds of data products.  And if we want to establish
our format as "the" standard format, we have to tell them what to
write if they don't want to bother with registering themselves and
their data collections just yet.

So, I don't think we should require, in our file format, any kind of
id, persistent or not (that's different for protocols, which may only
make sense *within* the VO).

>      Curation.publisher = "MAST"
>      Curation.publisherID = <none>
>      Curation.publisherDID = "ads/sh.hut#ngc4151_141"
>      DataID.datasetID = "ads/sh.hut#ngc4151_141"
>      DataID.creatorDID = "ngc4151_141"
> 
>     I'm not sure which location this 'global index id' should go.. so put
> it at both.

I'm against having the same value in two places on principle, but in
particular in this case.  *If* there's a publisher DID, I must be able to
relay on it being an IVORN, meaning I can take the stuff in front of
the ? and resolve it in a registry -- otherwise the whole exercise
becomes a bit moot.

So, if someone isn't part of the VO, and hence the dataset isn't
available through a VO service, Curation.publisherDID should be NULL.
That's fair, and it has a clear semantics.

>  2) My ignorance surrounding identifiers may become apparent here, but...
>      I'm not sure if a single dataset can be tagged with >1 identifier from
> any given
>      'global index service', but here are, presumably, multiple 'global

Well, that's a bit like with the cat on your porch.  You'll notice
your neighbours (other managers of dataset ids) will probably have
tagged it with other identifiers than you have, and depending on what
kind of food they're offering, it might accept all of these tags (for
a data product: make the identifier resolve to it).

For animal protection reasons, though, I totally against everyone
even trying to add their tags to the poor cat.  It'll only mess up
its collar, it's probably going to try to get rid of them (and trust
me, it will eventually succeed), and tag it all you want, without a
can opener it won't care about your tags anyway (meaning: just
because you add an identifier to the image metadata doesn't mean the
identifier magically resolves to it; the really important part is the
resolution service, and that doesn't depend on the in-dataset
metadata).

There's some point in allowing a single tag, though.  This would be
the tag of the person paying the veterinarian's bill if push comes to
shove, in particular if that tag allows finding her in that case.  In
the case of a data product, that would be exactly one identifier
designated by the dataset's creator; that might help in some weird
corner cases.

In general I'd say the utility of embedding IDs in the data products
is limited, as ids are really for resolving to data and metadata, and
if you have the data product, you already have both.  Thus I'd argue
as long as we clearly say what we want, it doesn't matter too much
what in detail it actually is.

>      If the ADS IDs are publication based, then this would be a
>      growing list, as a dataset is used in various research.
>      Keeping this sort of metadata accurate and current would
>      require frequent updates to the dataset itself.

Not, that certainly would not be the case -- as you say, that'd
simply be an unmaintainable mess.


Anyway, my understanding now is:

Curation.publisherDID is assigned by the publisher and used in
protocols like SSA or Datalink.
DataID.datasetID is some sort of global, persistent identifier, for instance,
a DOI, but not specified in detail.

As to making them required, I'd say both should be optional and
atomic (i.e., if they're there, there's only one of them per dataset).

Does everyone agree with this?  And if so, can we change the SDM2
document to reflect this semantics?

Cheers,

          Markus



More information about the dm mailing list