Resource Identifiers: discussion synthesis

Tue Jun 3 14:51:52 PDT 2003

Hi all,

Thanks for everyone who provided comments to the ID proposal.  As I've 
mentioned, I felt like it was in need of wider discussion.  And while the 
discussion may seem to have gotten a bit chaotic, I think we can distill 
some cogent issues.  From my reading, I believe the comments that have 
been raised address the following questions:

  1. Should IDs carry any semantic intformation?

  2. Who chooses/controls what is contained in the components of a 
     specific ID?  We specifically discussed who chooses the AuthorityID.

  3. How do IDs address the problem of transience and replication?  Do 
     they need to do more?  

In this message I will address each in turn.  While it is true that these 
are interrelated, I think conflating these makes it harder to make 
progress.  I encourage others to attempt to separate these as is possible.  

Here's my punchline ahead of time:  I think all of the concerns raised can 
be addressed with a minor adjustsment to the original proposal regarding 
who controls AuthorityIDs.  

--------------------------------------------------------------------------
1. Should IDs carry any semantic information?
--------------------------------------------------------------------------

Personally, I'm finding the myriad analogies to the book industry, email,
and stock tickers of limited value as they too quickly vear off the mark
and confuse the issue.  The best analogy for a VO identifier (which is so
close, it could cease to be an analogy) is one we all understand: URLs.  
Do URLs carry semantic content?  Sure they do: from a URL, we can often
deduce all sorts of things about what it points to.  Is there a standard
for how semantic meaning is encoded?  Absolutely not.  Do machines
universally rely on interpreting the semantic content?  No.  (In general,
the programs that do "micro-parse" the URLs are necessarily controlled by
the same people that control the content.)  The ID proposal intends no
more than this.

Whether or not the URL characters contains anything meaningful to anyone 
does not affect the ability of browsers and servers to talk to each other.  
Nevertheless, I think we can say that it is incredibly helpful that we can 
put little messages into them that help humans remember them, copy them 
without error, and debug the systems that use them.  This brings up a 
related question: are URLs intended for human consumption?  The answer is, 
no, normally not.  When hidden behind highlighted text, they can usually 
be ignored.  Nevertheless, humans do occasionally handle them directly.  

An advantage of adopting a URI-based identifier allows for this same 
flexibility in a manner that people are used to in URLs.  (The XML version 
is equivalent in composition; however, the parseable components are tagged 
individually to allow easier handling through XML parsers.)  Where the URL 
analogy *potentially* breaks down is addressed in the next section.

 Q:  Should IDs carry any semantic information?
 A:  They can, but they are not required to.  More precisely, the ID 
     standard and its use in standard registry interfaces should not rely 
     on it.  

Is this acceptable?

---------------------------------------------------------------------------
2.  Who controls the components of an ID? 
---------------------------------------------------------------------------

Back in February, the NVO project generated a set of requirements for IDs; 
one of them stated that the framework should maxmimize the freedom of data 
providers to choose identifiers for resources under their control.  This 
was the major point of discussion of the NVO telecon.  

The ID proposal intended that the AuthorityID (which would typically look 
like a DNS name) would be strictly associated with a standard registry 
interface.  In my mind, this was simply a mechanism to help ensure that 
IDs in total are globally unique: once a registry's AuthorityID is 
determined unique, the registry need only ensure that all its ResourceKeys 
are locally unique.  Thus, the AuthorityID establishes a namespace that 
the registry ultimately controls.  Thus, the data provider does not 
control the namespace *unless* they decide to run their own registry.  It 
was assumed that most providers would run their own, so this restriction 
would only affect a few (?) smaller providers.  

This is where the URL analogy breaks down.  A URL assumes that there is a
service running on the machine with the DNS name matching the URL's
host-id component.  The intention of VO ID proposal was similar but a bit
more vague: there would be a registry interface running on the registry
machine that given the ID could return a resource description.  However,
that interface is not yet defined, and it was not determined if the ID
should be automatically convertable to a service interface URL/handle.  

Critics of the proposal suggested that the choice of a AuthorityID, 
which establishes a namespace, should be controlled by the registrant.
This would allow organizations to have complete control over their own 
namespace without having to implement any standard registry service.  If 
we have full registries that really do contain all registered resources, 
then we do not need the AuthorityID to be tied to the registry where the 
resource was first registered.  

It is worth noting that regardless of who controls the AuthorityID,
introducing a new one will always require that it be checked against a
VO-wide registry of namespaces to determine if it has been used
before.  Thus, revising the proposal to tie the AuthorityID to an
organization does not change how we determine if the AuthorityID is
already in use.  It is harder, though, to ensure that the "owner" of
the namespace retains sole control over its use:  if a publisher
registers some resources in a namespace with one registry and some
with another, both registries need to know that the publisher truely
"owns" the namespace it is attempting refer to.  It can be done
(e.g. with grid-based certificates).  

The fundemental question, though, is: does the ID specification need
to be locked into the registry infrastructure.  At best, all the ID
framework needs is a way to determine who owns an AuthorityID.  If the
standard does not lock IDs into the registry infrastructure, then we
can potentially allow a number of implementations--either
simulataneously or a sequence that evolves over time--that encourage or
enforce ID uniqueness.  This could include an implementation that
requires the publisher run a particular registry service.

  Q: Who controls the components of an ID?
  Original A:  the registry
  Revised A:  the registering data provider.  Thus, the AuthorityID no 
    longer implies the existance of any registry service specific to the 
    AuthorityID.  The specification merely requires that AuthorityID's
    are uniquely associated with organizations (or individuals) that
    own them. 

------------------------------------------------------------------------
3.  How do IDs address Transience and Replication? 
------------------------------------------------------------------------

This issue, as Arnold points out, touches on the need for having
persistant names that can refer to a resource in perpetuity even when
support of the resource changes over time or is replicated across
multiple locations (See http://archives.us-vo.org/metadata/0762.html).  
This is exactly what a URN (a type of URI) does.  

In my mind, the ID proposal does *not* address the use case Arnold
described; that is, VO identifiers are not URNs.  In particular, if a
data collection is mirrored at two different locations and thus
accessible through interfaces with different URLs/handles, then the
two mirrors are considered distinct and therefore have different
resource identifiers.  VO identifiers are tied to an organization
that maintains the resourse they identify via the AuthorityID.  If
the access to a resource moves to a different machine, its ID need not
change; the resource description it points to can be updated to the
new location.  However, if curation is transfered to a new
organization, the ID cannot persist unless ownership of the original
namespace is transfered in whole as well.  

A URN scheme is certainly needed; however, we also need a way of
distinguishing mirrors.  Thus, VO identifiers should not be URNs.  

A URN system will necessarily need to build on top of both the ID
standard as well as registry interfaces.  In particular (as Arnold
explains), registries should be able to map a URN to a set of matching
identifiers that are mirrors of the same resource. 

 Q: How do IDs address the problem of transience and replication?  
 A: They do not.  Replicated resources have different IDs; one must
    consult the resources' metadata to know that they are mirrors.
    IDs may persist when the resource moves around within its own
    namespace; however, they cannot persist when the resource is
    curated by a new organization with a differnent namespace.

 Q: Do they need to do more?
 A: No.  A URN system should be built on top the ID standard and
    registry interfaces.  

---------------------------------------------------------------------
In conclusion, I am recommending that an AuthorityID be "owned" and 
controlled by a registering organization, but that the mechanism for 
encouraging or enforcing that control not be part of the ID specification.  

My apologies for the length of this installment, but I hope it will help 
focus our discussion. 

cheers,
Ray