PubDIDs (and DIDs in general, maybe)

Arnold Rots arots at cfa.harvard.edu
Fri Jan 31 12:27:04 PST 2014


I think you are missing the point here.
the argument I am making is that extending a persistent identifier
with additional parameters would allow to drill down into compound
data objects; for instance, select individual files from multi-file
datasets. Persistence requires that whatever is issued as a persistent
identifier will remain supported in perpetuity by the repository and its
successors. Fragility is simply not allowed.
However, one might envision allowing non-persistent options: these
would be parameters and parameter values that are not included
in the persistent identifier, but can, at any given time, be obtained
from the repository through a metadata query option; so it would
be a dynamic construction. It would require that something akin to:
ivo://ADS/Sa.CXO?mode=optionquery
should be included in the persistent set.

Cheers,

  - Arnold

-------------------------------------------------------------------------------------------------------------
Arnold H. Rots                                          Chandra X-ray
Science Center
Smithsonian Astrophysical Observatory                   tel:  +1 617 496
7701
60 Garden Street, MS 67                                      fax:  +1 617
495 7356
Cambridge, MA 02138
arots at cfa.harvard.edu
USA
http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------------------------------------------



On Wed, Jan 29, 2014 at 12:20 PM, Douglas Tody <dtody at nrao.edu> wrote:

> It isn't necessary to expose the inner workings of one's archive to this
> extent to provide this capability.  Exposing both internal parameters or
> long pathnames, especially in what is supposed to be a persistent
> dataset identifier, will result in fragile references that either
> eventually break, or lock one into a rigid archive structure to avoid
> changes that break external references.
>
> Instead one can merely expose a reference to the dataset, and save the
> information required to generate the referenced dataset internally in
> the archive, e.g., in a table record or file.  Then if we then have
>
>     ivo://ADS/Sa.CXO#obs/05285
>
> then "obs/05285" is used to do something archive-side like looking up a
> record with ID=05285 in the "obs" table to see how to generate the
> virtual data product.  Such a dataset identifier can be persisted
> indefinitely, but we have complete flexibility in how to generate the
> data product within the archive, and the internal workings can evolve
> without breaking the identifier.
>
> Another point is that we often use dataset identifiers as foreign keys
> to reference datasets.  The DID is passed back the client and later used
> to access the dataset, and access can take many forms: retrieval of the
> dataset yes, but also retrieval of dataset metadata, looking up
> dataLinks for the dataset, etc.  If the DID stops being a simple
> identifier and instead morphs into a function call, then it loses this
> capability.
>
>         - Doug
>
>
>
> On Tue, 28 Jan 2014, Arnold Rots wrote:
>
>  The reason I was contemplating, in an earlier post, replacing the # by ?
>> was that it would allow parameterization of the identifiers.
>> The advantage is that it can implement very flexible drilling into
>> datasets
>> without increasing the number of identifiers in the registry.
>>
>> In the case quoted in earlier posts ivo://ADS/Sa.CXO#obs/05285
>> currently translates into (I think):
>> http://cda.harvard.edu/chaser/searchOcat.do?obsid=05285
>> and the ADS keeps a full lookup table for all dataset identifiers.
>> That URL brings the client to a landing page where some tar packages
>> can be selected.
>>
>> If, instead, the identifier were written as:
>> ivo://ADS/Sa.CXO?obsid=05285
>> then the ADS lookup service would only need to know that ivo://ADS/Sa.CXO
>> translates into http://cda.harvard.edu/chaser/searchOcat.do, for all
>> Chandra
>> identifiers.
>>
>> That simplifies matters already, but in addition one can allow extensions
>> that
>> drill down directly to individual files in the package:
>> ivo://ADS/Sa.CXO?obsid=05285&type=event&level=2
>> Everything after the question mark gets passed on to the server.
>> If the server is smart, one might even allow drilling down into the file,
>> selecting columns:
>> ivo://ADS/Sa.CXO?obsid=05285&type=event&level=2&column=Time,pha
>> or just particular values:
>> i
>> vo://ADS/Sa.CXO?obsid=05285&type=event&level=2&column=
>> Time,pha&tstart=2010-04-15T12:30:36
>> &tstop=2010-04-15T13:23:00
>>
>> This may not be a high priority for the current use of dataset identifiers
>> linking entire datasets to papers, but it would be extremely useful when
>> we
>> start using persistent identifiers for published data in data discovery
>> and
>> focused data mining.
>>
>> In short: the persistent identifier registry only needs to be aware of the
>> part in front of  '?' (%3F), and then it is up to the service to define
>> what
>> parameters it allows (and that functionality needs to be queriable, of
>> course);
>> potentially that single identifier can stand for an infinite number of
>> identifier instances.
>> I should add that it does not matter, of course, whether the persistent
>> identifier's
>> root is ivo://ADS/<something>.<something> or a DOI
>>
>> Cheers,
>>
>>  - Arnold
>>
>> ------------------------------------------------------------
>> -------------------------------------------------
>> Arnold H. Rots                                          Chandra X-ray
>> Science Center
>> Smithsonian Astrophysical Observatory                   tel:  +1 617 496
>> 7701
>> 60 Garden Street, MS 67                                      fax:  +1 617
>> 495 7356
>> Cambridge, MA 02138
>> arots at cfa.harvard.edu
>> USA
>> http://hea-www.harvard.edu/~arots/
>> ------------------------------------------------------------
>> --------------------------------------------------
>>
>>
>>
>> On Thu, Jan 16, 2014 at 7:54 PM, Accomazzi, Alberto <
>> aaccomazzi at cfa.harvard.edu> wrote:
>>
>>  At the danger of stating the obvious: we all know that Norman speaks the
>>> truth.
>>>
>>> Thanks for catching my URN vs. URI mangling -- I admit I hadn't looked up
>>> the definition of either one in quite a while.  But despite the misusage
>>> of
>>> terms, my point was that the ADS persistent ids were not born as IVORNs
>>> for
>>> both practical and political reasons, and I don't think it's worth
>>> agonizing about whether or not they can/should be retrofitted into that
>>> scheme now.  However, if agonize we must, one way out of this IMHO is to
>>> simply say the following:
>>>
>>> 1. the resource persistent identifier is: ADS/Sa.CXO#obs/05285
>>> 2. its corresponding IVO URI is: ivo://ADS/Sa.CXO%23obs/05285
>>> 3. its actionable URL is (as of today):
>>> http://vo.ads.harvard.edu/dv/DataResolver.cgi?ADS%2FSa.CXO%23obs%2F05285
>>>
>>> i.e. there is a URL-encoding step in going from the identifier to the
>>> URIs.  Doesn't look as pretty as we might have wanted, but it works.
>>>
>>> As far as managing these identifiers, let me add a pointer to the EZID
>>> system that CDL uses for its datacite DOIs and arks:
>>> http://n2t.net/ezid/
>>> The resolver and registry that they maintain could easily support the ivo
>>> URI scheme if we wanted to, but again no need to go that route unless we
>>> need it for something that plain http doesn't already provide.
>>>
>>> Cheers,
>>> -- Alberto
>>>
>>>
>>>
>>> On Thu, Jan 16, 2014 at 1:46 PM, Norman Gray <norman at astro.gla.ac.uk
>>> >wrote:
>>>
>>>
>>>> Alberto and all, hello.
>>>>
>>>> On 2014 Jan 16, at 15:14, Accomazzi, Alberto <
>>>> aaccomazzi at cfa.harvard.edu>
>>>> wrote:
>>>>
>>>> +1 generally, but...
>>>>
>>>>  I think a better way to keep this straight is to think of the "ADS"
>>>>>
>>>> identifiers as URNs and the ivo identifiers as URIs.
>>>>
>>>> Unleashing my inner lawyer: recall that URNs are (according to RFC 2396)
>>>> merely one of the two types of URIs, namely "the subset of URI that are
>>>> required to remain globally unique and persistent even when the resource
>>>> ceases to exist or becomes unavailable."
>>>>
>>>> RFC 3968 <https://www.ietf.org/rfc/rfc3986.txt> mentions that '[a] URI
>>>> can be further classified as a locator, a name, or both', and that
>>>> '[t]he
>>>> term "Uniform Resource Name" (URN) has been used historically to refer
>>>> to
>>>> both URIs under the "urn" scheme', but that 'Future specifications and
>>>> related documentation should use the general term "URI" rather than the
>>>> more restrictive terms "URL" and "URN".'
>>>>
>>>> All that said...
>>>>
>>>>  6. Having said all of this, I still do have one basic question about
>>>>>
>>>> the ivo identifiers that you want to use in datalink, based on my
>>>> current
>>>> understanding of them.  Specifically, given that these lack persistence
>>>> and
>>>> multiple resolution features, why bother at all rather than using a
>>>> plain
>>>> http uris?  I think this question is worth considering now since the
>>>> experience with the dataset ids has taught me that unless there are
>>>> compelling reason to go with a discipline-specific, custom solution you
>>>> may
>>>> be better off using what the web already gives you for free: namely http
>>>> and dns.
>>>>
>>>> I think this is a really important point, which isn't made often enough
>>>> (cue hobbyhorse).  Without _necessarily_ discounting the existence of
>>>> such
>>>> 'compelling reasons', non-standard schemes do come with a cost, and
>>>> they're
>>>> not magic, so that if your resolution mechanism disappears, a URN-named
>>>> object is just as lost, and just as nameless, as one named with a 404ed
>>>> HTTP URI.
>>>>
>>>> I remember a workshop on persistent identifiers of a few years ago,
>>>> where
>>>> Stuart Weibel (I think; or it may have been John Kunze) made this point
>>>> very convincingly.  Something under purl.org or under id.loc.gov has an
>>>> "institutional commitment to persistence" which is worth an awful lot
>>>> more
>>>> than any amount of indirection that you get through a fancy URI scheme.
>>>>  As
>>>> Stuart (or whoever) said , "loc.gov isn't going away any time soon".
>>>>
>>>> DOIs do, I think, have a pretty compelling reason to be a special URI
>>>> scheme, but the thing that's key about DOIs is not the scheme, or the
>>>> Handle-based lookup mechanism, but precisely the "institutional
>>>> commitment
>>>> to persistence" that they represent.
>>>>
>>>> I don't plan to reopen any discussion here about IVORNs -- fear not,
>>>> everyone -- but will simply note that, on general principles, obsessing
>>>> about the punctuation of URIs is probably a distant second in
>>>> importance to
>>>> developing and planning these sorts of institutional commitments within
>>>> the
>>>> IVOA.
>>>>
>>>> All the best,
>>>>
>>>> Norman
>>>>
>>>>
>>>> --
>>>> Norman Gray  :  http://nxg.me.uk
>>>> SUPA School of Physics and Astronomy, University of Glasgow, UK
>>>>
>>>>
>>>>
>>>
>>> --
>>> Dr. Alberto Accomazzi
>>> Program Manager
>>> NASA Astrophysics Data System - http://ads.harvard.edu
>>> Harvard-Smithsonian Center for Astrophysics - http://www.cfa.harvard.edu
>>> 60 Garden St, MS 83, Cambridge, MA 02138, USA
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ivoa.net/pipermail/datacp/attachments/20140131/b4889ebd/attachment.html>


More information about the datacp mailing list