Identifiers 2.0 Public RFC results

Mon Oct 5 17:50:21 CEST 2015

Hi Markus,

Thank you for the thorough response on my comments.  I have detailed
replies below but to offer an executive summary:

1. I'm fine with the statement that pubDIDs are neither persistent nor
resolvable per-se
2. However, I think that the capability of resolution should be explicitly
exposed and optionally supported through a well-defined mechanism
3. It seems to me that Datalink would be the natural conduit for providing
DID resolution

Cheers,
-- Alberto

On Fri, Oct 2, 2015 at 4:01 AM, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Dear Alberto,
>
> On Thu, Oct 01, 2015 at 10:50:04PM -0400, Accomazzi, Alberto wrote:
> > Sorry for coming late to the discussion, but I have some concerns about
> > section 4.1 of the specification (Dataset Identifiers).  What troubles me
> > is the resolution of these identifiers, and the fact that the spec itself
> > states that "This specification does not exhaustively define the
> resolution
> > of publisher DIDs. Instead, we recommend the following procedure..." .
>
> I give you the non-exhaustivity is somewhat unfortunate, but PubDIDs
> weren't really designed to resolve, I believe.  They just turned up
> in various standards (SSAP, Obscore, some data models, then Datalink,
> now SIAv2).  My understanding is that the motivation was to have
> globally unique identifiers so you can combine responses from
> different services and still can group by something (i.e., the DID)
> to tell apart datasets.  Which is a reasonable use case, I'd say.
>

Ok, this is the part where my bias leads me to think that there is no
practical use for an identifier unless it's actionable (and therefore
resolvable).  It seems to me that in practice you are suggesting that the
services that emit these identifiers are be able to resolve them at some
level, but there is no general normative resolution strategy defined by VO
standards.  Note that I am ready to accept your argument that this ain't
necessarily so, and if so I will keep my peace, but it would be nice to
have some clarity on this IMHO.

Now, SSAP said something on how they were to be formed, in a fashion
> that was later criticized by Norman; one reason I went into the
> trouble of revising Identifiers was an attempt to fix what Norman
> criticized.  For the unique-in-union-of-service-responses use case,
> that form may not even matter, so I'm (for a change) not blaming
> SSAP.  I'm just saying that *if* we want to do other things with the
> PubDIDs, and with Datalink we're starting to do that, we're smart if
> we don't bend URI rules.
>

Agreed, I remember that well, and I'm glad to see the realignment of IVOIDs
with URIs.

Now, if we follow URI rules, we suddenly can do some
> cute^H^H^H^Hpotentially useful things.  Parsing PubDIDs into Registry
> and local part is one of these cute things.  It's not the reason why
> the PubDIDs are there, it's something that happens to become possible
> as we give the URI parts meaning (for perspective: This is where it
> started:
> http://mail.ivoa.net/pipermail/registry/2014-January/004905.html, and
> what moderate response there was to the questionnaire essentially
> advocated something like what's in now, except of course the
> resolution procedure is entirely my fault).

Ah, thanks, I admit I didn't chime in when you asked for input back then.
My bad.

>
> > Here is why I think this is a problem:
> >
> > - the spec seems to suggests that resolving these is more a matter of
> > heuristics than anything else, so different implementors may chose to
> tweak
> > the logic in ways that are not totally consistent
>
> True.  Unless we mandate a common interface on all services using
> PubDIDs (either on the service interfaces themselves, or in a
> separate, say, datalink capability), I think there's little we can do
> about this.  Of course it'd have been great if we could just say
> "grab this, this, and this capability and then query
> <endpoint>?ID=pubDID, but I guess we don't want to change the
> respective standards, least of all for something that's probably not
> going to be an important use case in the first place.

Well you could imagine a scenario in which you say "if you are going to
mint pubDIDs, then you must provide a service for resolving them."  The
service could very well be a Datalink endpoint, but in theory it could also
be something else which returns standard metadata, and it would have to be
defined in the Registry.  Based on my quick read of the Datalink standard I
see no reason why it couldn't provide the kind of resolution I'm thinking
about, but I admit that I don't fully understand all the details.

>
> > - even if the recipe were prescriptive, an addition of a new capability
> > (e.g. Datalink in addition to SSA) could potentially change the way a
> > particular DID is resolved, thus yielding a different result at a later
> time
>
> True.  I'd claim this is an advantage.  I don't mean PubDIDs to
> supplant DOIs, I don't think people should be referencing them, at
> least not in the (annoying) "preformance metric" abuse of doing
> citations; as to "here's what you need to reproduce my results", I
> believe letting data providers go with progress is really an
> advantage.  If your result depended on the concrete data format, it
> was probably wrong in the first place...
>

Ok, agree with you.  I wasn't trying to imply that there needed to be
persistence associated with the results returned via the resolution process
(or even persistence of the identifier).  So long as we agree that the
semantics behind an identifier should not change I'm fine (i.e. the "thing"
that ivo://org.gavo.dc/feros/q/ssa?f04031.bdf points to is always the same
entity, although its particular manifestations may change in time).

> For actual work, I claim it's an advantage if the resolution result
> can change over time.  Consider a spectrum you got through SSA.
> After a while, the publication changes, and the thing goes to
> Obscore+Datalink.  What changes now is that you get more metadata
> (from obscore) and you get potentially a lot more data links, which
> might, for instance, tell you there further processing has been done
> or that there's now a flux calibrated version of your dataset.  If
> data providers are really careful, they might still provide the
> "original" dataset alongside the re-reduced ones in the datalink
> result.
>
> Of course, it may be that your legacy client can't deal with datalink
> results or can't speak TAP/Obscore.  Given the choice between this
> and evolvability, I'd go for evolvability: If you will, this is
> intended as "operational", not (really) "curational".
>
> > - there is no hint of what would be returned when a client tries to
> resolve
> > one of these DIDs, which I think is a problem for any application which
> > wants to do something with them
>
> True.  But that's really true of all those services.  The VO (perhaps
> unfortunately) lets people toss in datasets of all sorts.  The
> protocols mentioned let you discover a media type, so if you really
> wanted you could have additional information (of course, in ways
> depending on the access protocol -- sigh) before actually going for
> it, but that is, I believe, not something any existing VO client
> actually does.  As far as I can tell, all of the jump first and see
> if they fall later.
>
> > Ultimately I am still confused as to the role and usefulness of these
> DIDs:
> > they are not persistent, are difficult to resolve (it seems), and there
> is
> > no infrastructure for returning standard metadata about the resource that
> > they point to (is this correct?).  Which makes me wonder why one would
> not
> > want to use DIDs rather than plain http URIs for retrieval or more
> durable
> > identifiers if persistence and metadata registration is required.
>
> Essentially, there can be a 1:n relationship between PubDIDs and
> access urls, for instance, when there's different formats of a
> dataset, or there's the think itself and the associated datalink
> document.  This one is a simple example:
>
>
> http://dc.g-vo.org/ivoidval/q/didresolve/form?__nevow_form__=genForm&pub_did=ivo%3A%2F%2Forg.gavo.dc%2Fferos%2Fq%2Fssa%3Ff04031.bdf
>
> PubDIDs let you detect such situations even in large bags of data
> from different services.  That, I think is the entire reason why they
> were introduced.  And given the requirements on DOI-referenced
> datasets I claim we can't used DOIs for that purpose.
>

Agree.  Just to be clear: I'm not suggesting the use of DOIs in place of
PubDIDs at all.  I'm simply trying to explore if and how we can use
existing VO infrastructure to solve some of the problems related to dataset
publication and preservation.  And none of the use cases I have in mind
include a one-to-one assignment of a DOI to a PubDID.

That, of course, doesn't mean we have to pretend PubDIDs are
> resolvable.
>
> The part about PubDID resolution was by far the most contentious one
> of the whole standard.  Since global PubDID resolution is, I believe,
> more a gimmick than something centrally important, I could well
> leave it out.  The procedure described would still work, so there's
> not even any harm done.
>

I would suggest that we should at least consider the scenario where
resolution is assured under certain circumstances (which are under the
control of the data provider).  This could be simply indicated by the
presence of a Datalink endpoint with an optional attribute.  Why bother
with this?  Because if I know that I have a resolution service which emits
standard metadata records then I can at least begin to contemplate
registering collections of such identifiers with a persistent id some day.
If instead these pubDIDs aren't actionable then I'll be looking to build
these collections out of HTTP URIs or something else.

> So, here's my offer: If you want this out and care enough, speak up
> (or say: put it into an appendix and have a fat (red?)
> "non-normative" in its title).  If you think it's cute and it can
> remain in (and care enough), speak up, too.  I'll take private votes
> if you're shy, and will summarise on-list if necessary.
>
> If there's no signal, I'd take the liberty to take PubDID resolution
> into TCG review and let them shoot it down if they want.  If there's
> mainly negative signals, I'll take it out without further griping.
>

Well, I spoke up, so you know my point of view.  Is it silly to think that
he resolution bit belongs in a separate spec? (And is it realistic to think
that that spec will get written anytime soon?)  I note that RFC 3986 does
not discuss the actual resolution mechanism except for the relative
reference within a URI, so I think the document can stand as is without the
section in question.

Cheers,
-- Alberto

>
> Cheers,
>
>            Markus
>
>

-- 
Dr. Alberto Accomazzi
Program Manager
NASA Astrophysics Data System - http://ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics - http://www.cfa.harvard.edu
60 Garden St, MS 83, Cambridge, MA 02138, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/registry/attachments/20151005/58902c74/attachment.html>