Identifiers 2.0 Public RFC results

Fri Oct 2 10:01:38 CEST 2015

Dear Alberto,

On Thu, Oct 01, 2015 at 10:50:04PM -0400, Accomazzi, Alberto wrote:
> Sorry for coming late to the discussion, but I have some concerns about
> section 4.1 of the specification (Dataset Identifiers).  What troubles me
> is the resolution of these identifiers, and the fact that the spec itself
> states that "This specification does not exhaustively define the resolution
> of publisher DIDs. Instead, we recommend the following procedure..." .

I give you the non-exhaustivity is somewhat unfortunate, but PubDIDs
weren't really designed to resolve, I believe.  They just turned up
in various standards (SSAP, Obscore, some data models, then Datalink,
now SIAv2).  My understanding is that the motivation was to have
globally unique identifiers so you can combine responses from
different services and still can group by something (i.e., the DID)
to tell apart datasets.  Which is a reasonable use case, I'd say.

Now, SSAP said something on how they were to be formed, in a fashion
that was later criticized by Norman; one reason I went into the
trouble of revising Identifiers was an attempt to fix what Norman
criticized.  For the unique-in-union-of-service-responses use case,
that form may not even matter, so I'm (for a change) not blaming
SSAP.  I'm just saying that *if* we want to do other things with the
PubDIDs, and with Datalink we're starting to do that, we're smart if
we don't bend URI rules.

Now, if we follow URI rules, we suddenly can do some
cute^H^H^H^Hpotentially useful things.  Parsing PubDIDs into Registry
and local part is one of these cute things.  It's not the reason why
the PubDIDs are there, it's something that happens to become possible
as we give the URI parts meaning (for perspective: This is where it
started:
http://mail.ivoa.net/pipermail/registry/2014-January/004905.html, and
what moderate response there was to the questionnaire essentially
advocated something like what's in now, except of course the
resolution procedure is entirely my fault).

> Here is why I think this is a problem:
> 
> - the spec seems to suggests that resolving these is more a matter of
> heuristics than anything else, so different implementors may chose to tweak
> the logic in ways that are not totally consistent

True.  Unless we mandate a common interface on all services using
PubDIDs (either on the service interfaces themselves, or in a
separate, say, datalink capability), I think there's little we can do
about this.  Of course it'd have been great if we could just say
"grab this, this, and this capability and then query
<endpoint>?ID=pubDID, but I guess we don't want to change the
respective standards, least of all for something that's probably not
going to be an important use case in the first place.

> - even if the recipe were prescriptive, an addition of a new capability
> (e.g. Datalink in addition to SSA) could potentially change the way a
> particular DID is resolved, thus yielding a different result at a later time

True.  I'd claim this is an advantage.  I don't mean PubDIDs to
supplant DOIs, I don't think people should be referencing them, at
least not in the (annoying) "preformance metric" abuse of doing
citations; as to "here's what you need to reproduce my results", I
believe letting data providers go with progress is really an
advantage.  If your result depended on the concrete data format, it
was probably wrong in the first place...

For actual work, I claim it's an advantage if the resolution result
can change over time.  Consider a spectrum you got through SSA.
After a while, the publication changes, and the thing goes to
Obscore+Datalink.  What changes now is that you get more metadata
(from obscore) and you get potentially a lot more data links, which
might, for instance, tell you there further processing has been done
or that there's now a flux calibrated version of your dataset.  If
data providers are really careful, they might still provide the
"original" dataset alongside the re-reduced ones in the datalink
result.

Of course, it may be that your legacy client can't deal with datalink
results or can't speak TAP/Obscore.  Given the choice between this
and evolvability, I'd go for evolvability: If you will, this is
intended as "operational", not (really) "curational".

> - there is no hint of what would be returned when a client tries to resolve
> one of these DIDs, which I think is a problem for any application which
> wants to do something with them

True.  But that's really true of all those services.  The VO (perhaps
unfortunately) lets people toss in datasets of all sorts.  The
protocols mentioned let you discover a media type, so if you really
wanted you could have additional information (of course, in ways
depending on the access protocol -- sigh) before actually going for
it, but that is, I believe, not something any existing VO client
actually does.  As far as I can tell, all of the jump first and see
if they fall later.

> Ultimately I am still confused as to the role and usefulness of these DIDs:
> they are not persistent, are difficult to resolve (it seems), and there is
> no infrastructure for returning standard metadata about the resource that
> they point to (is this correct?).  Which makes me wonder why one would not
> want to use DIDs rather than plain http URIs for retrieval or more durable
> identifiers if persistence and metadata registration is required.

Essentially, there can be a 1:n relationship between PubDIDs and
access urls, for instance, when there's different formats of a
dataset, or there's the think itself and the associated datalink
document.  This one is a simple example:

http://dc.g-vo.org/ivoidval/q/didresolve/form?__nevow_form__=genForm&pub_did=ivo%3A%2F%2Forg.gavo.dc%2Fferos%2Fq%2Fssa%3Ff04031.bdf

PubDIDs let you detect such situations even in large bags of data
from different services.  That, I think is the entire reason why they
were introduced.  And given the requirements on DOI-referenced
datasets I claim we can't used DOIs for that purpose.

That, of course, doesn't mean we have to pretend PubDIDs are
resolvable.

The part about PubDID resolution was by far the most contentious one
of the whole standard.  Since global PubDID resolution is, I believe,
more a gimmick than something centrally important, I could well
leave it out.  The procedure described would still work, so there's
not even any harm done.

So, here's my offer: If you want this out and care enough, speak up
(or say: put it into an appendix and have a fat (red?)
"non-normative" in its title).  If you think it's cute and it can
remain in (and care enough), speak up, too.  I'll take private votes
if you're shy, and will summarise on-list if necessary.

If there's no signal, I'd take the liberty to take PubDID resolution
into TCG review and let them shoot it down if they want.  If there's
mainly negative signals, I'll take it out without further griping.

Cheers,

           Markus