PubDIDs (and DIDs in general, maybe)

Douglas Tody dtody at nrao.edu
Wed Jan 29 09:20:11 PST 2014


It isn't necessary to expose the inner workings of one's archive to this
extent to provide this capability.  Exposing both internal parameters or
long pathnames, especially in what is supposed to be a persistent
dataset identifier, will result in fragile references that either
eventually break, or lock one into a rigid archive structure to avoid
changes that break external references.

Instead one can merely expose a reference to the dataset, and save the
information required to generate the referenced dataset internally in
the archive, e.g., in a table record or file.  Then if we then have

     ivo://ADS/Sa.CXO#obs/05285

then "obs/05285" is used to do something archive-side like looking up a
record with ID=05285 in the "obs" table to see how to generate the
virtual data product.  Such a dataset identifier can be persisted
indefinitely, but we have complete flexibility in how to generate the
data product within the archive, and the internal workings can evolve
without breaking the identifier.

Another point is that we often use dataset identifiers as foreign keys
to reference datasets.  The DID is passed back the client and later used
to access the dataset, and access can take many forms: retrieval of the
dataset yes, but also retrieval of dataset metadata, looking up
dataLinks for the dataset, etc.  If the DID stops being a simple
identifier and instead morphs into a function call, then it loses this
capability.

 	- Doug


On Tue, 28 Jan 2014, Arnold Rots wrote:

> The reason I was contemplating, in an earlier post, replacing the # by ?
> was that it would allow parameterization of the identifiers.
> The advantage is that it can implement very flexible drilling into datasets
> without increasing the number of identifiers in the registry.
>
> In the case quoted in earlier posts ivo://ADS/Sa.CXO#obs/05285
> currently translates into (I think):
> http://cda.harvard.edu/chaser/searchOcat.do?obsid=05285
> and the ADS keeps a full lookup table for all dataset identifiers.
> That URL brings the client to a landing page where some tar packages
> can be selected.
>
> If, instead, the identifier were written as:
> ivo://ADS/Sa.CXO?obsid=05285
> then the ADS lookup service would only need to know that ivo://ADS/Sa.CXO
> translates into http://cda.harvard.edu/chaser/searchOcat.do, for all Chandra
> identifiers.
>
> That simplifies matters already, but in addition one can allow extensions
> that
> drill down directly to individual files in the package:
> ivo://ADS/Sa.CXO?obsid=05285&type=event&level=2
> Everything after the question mark gets passed on to the server.
> If the server is smart, one might even allow drilling down into the file,
> selecting columns:
> ivo://ADS/Sa.CXO?obsid=05285&type=event&level=2&column=Time,pha
> or just particular values:
> i
> vo://ADS/Sa.CXO?obsid=05285&type=event&level=2&column=Time,pha&tstart=2010-04-15T12:30:36
> &tstop=2010-04-15T13:23:00
>
> This may not be a high priority for the current use of dataset identifiers
> linking entire datasets to papers, but it would be extremely useful when we
> start using persistent identifiers for published data in data discovery and
> focused data mining.
>
> In short: the persistent identifier registry only needs to be aware of the
> part in front of  '?' (%3F), and then it is up to the service to define what
> parameters it allows (and that functionality needs to be queriable, of
> course);
> potentially that single identifier can stand for an infinite number of
> identifier instances.
> I should add that it does not matter, of course, whether the persistent
> identifier's
> root is ivo://ADS/<something>.<something> or a DOI
>
> Cheers,
>
>  - Arnold
>
> -------------------------------------------------------------------------------------------------------------
> Arnold H. Rots                                          Chandra X-ray
> Science Center
> Smithsonian Astrophysical Observatory                   tel:  +1 617 496
> 7701
> 60 Garden Street, MS 67                                      fax:  +1 617
> 495 7356
> Cambridge, MA 02138
> arots at cfa.harvard.edu
> USA
> http://hea-www.harvard.edu/~arots/
> --------------------------------------------------------------------------------------------------------------
>
>
>
> On Thu, Jan 16, 2014 at 7:54 PM, Accomazzi, Alberto <
> aaccomazzi at cfa.harvard.edu> wrote:
>
>> At the danger of stating the obvious: we all know that Norman speaks the
>> truth.
>>
>> Thanks for catching my URN vs. URI mangling -- I admit I hadn't looked up
>> the definition of either one in quite a while.  But despite the misusage of
>> terms, my point was that the ADS persistent ids were not born as IVORNs for
>> both practical and political reasons, and I don't think it's worth
>> agonizing about whether or not they can/should be retrofitted into that
>> scheme now.  However, if agonize we must, one way out of this IMHO is to
>> simply say the following:
>>
>> 1. the resource persistent identifier is: ADS/Sa.CXO#obs/05285
>> 2. its corresponding IVO URI is: ivo://ADS/Sa.CXO%23obs/05285
>> 3. its actionable URL is (as of today):
>> http://vo.ads.harvard.edu/dv/DataResolver.cgi?ADS%2FSa.CXO%23obs%2F05285
>>
>> i.e. there is a URL-encoding step in going from the identifier to the
>> URIs.  Doesn't look as pretty as we might have wanted, but it works.
>>
>> As far as managing these identifiers, let me add a pointer to the EZID
>> system that CDL uses for its datacite DOIs and arks: http://n2t.net/ezid/
>> The resolver and registry that they maintain could easily support the ivo
>> URI scheme if we wanted to, but again no need to go that route unless we
>> need it for something that plain http doesn't already provide.
>>
>> Cheers,
>> -- Alberto
>>
>>
>>
>> On Thu, Jan 16, 2014 at 1:46 PM, Norman Gray <norman at astro.gla.ac.uk>wrote:
>>
>>>
>>> Alberto and all, hello.
>>>
>>> On 2014 Jan 16, at 15:14, Accomazzi, Alberto <aaccomazzi at cfa.harvard.edu>
>>> wrote:
>>>
>>> +1 generally, but...
>>>
>>>> I think a better way to keep this straight is to think of the "ADS"
>>> identifiers as URNs and the ivo identifiers as URIs.
>>>
>>> Unleashing my inner lawyer: recall that URNs are (according to RFC 2396)
>>> merely one of the two types of URIs, namely "the subset of URI that are
>>> required to remain globally unique and persistent even when the resource
>>> ceases to exist or becomes unavailable."
>>>
>>> RFC 3968 <https://www.ietf.org/rfc/rfc3986.txt> mentions that '[a] URI
>>> can be further classified as a locator, a name, or both', and that '[t]he
>>> term "Uniform Resource Name" (URN) has been used historically to refer to
>>> both URIs under the "urn" scheme', but that 'Future specifications and
>>> related documentation should use the general term "URI" rather than the
>>> more restrictive terms "URL" and "URN".'
>>>
>>> All that said...
>>>
>>>> 6. Having said all of this, I still do have one basic question about
>>> the ivo identifiers that you want to use in datalink, based on my current
>>> understanding of them.  Specifically, given that these lack persistence and
>>> multiple resolution features, why bother at all rather than using a plain
>>> http uris?  I think this question is worth considering now since the
>>> experience with the dataset ids has taught me that unless there are
>>> compelling reason to go with a discipline-specific, custom solution you may
>>> be better off using what the web already gives you for free: namely http
>>> and dns.
>>>
>>> I think this is a really important point, which isn't made often enough
>>> (cue hobbyhorse).  Without _necessarily_ discounting the existence of such
>>> 'compelling reasons', non-standard schemes do come with a cost, and they're
>>> not magic, so that if your resolution mechanism disappears, a URN-named
>>> object is just as lost, and just as nameless, as one named with a 404ed
>>> HTTP URI.
>>>
>>> I remember a workshop on persistent identifiers of a few years ago, where
>>> Stuart Weibel (I think; or it may have been John Kunze) made this point
>>> very convincingly.  Something under purl.org or under id.loc.gov has an
>>> "institutional commitment to persistence" which is worth an awful lot more
>>> than any amount of indirection that you get through a fancy URI scheme.  As
>>> Stuart (or whoever) said , "loc.gov isn't going away any time soon".
>>>
>>> DOIs do, I think, have a pretty compelling reason to be a special URI
>>> scheme, but the thing that's key about DOIs is not the scheme, or the
>>> Handle-based lookup mechanism, but precisely the "institutional commitment
>>> to persistence" that they represent.
>>>
>>> I don't plan to reopen any discussion here about IVORNs -- fear not,
>>> everyone -- but will simply note that, on general principles, obsessing
>>> about the punctuation of URIs is probably a distant second in importance to
>>> developing and planning these sorts of institutional commitments within the
>>> IVOA.
>>>
>>> All the best,
>>>
>>> Norman
>>>
>>>
>>> --
>>> Norman Gray  :  http://nxg.me.uk
>>> SUPA School of Physics and Astronomy, University of Glasgow, UK
>>>
>>>
>>
>>
>> --
>> Dr. Alberto Accomazzi
>> Program Manager
>> NASA Astrophysics Data System - http://ads.harvard.edu
>> Harvard-Smithsonian Center for Astrophysics - http://www.cfa.harvard.edu
>> 60 Garden St, MS 83, Cambridge, MA 02138, USA
>>
>


More information about the datacp mailing list