Resource Identifiers: discussion synthesis

Tue Jun 3 16:42:42 PDT 2003

I think I buy all that ..
wil
On Tue, Jun 03, 2003 at 04:51:52PM -0500, Ray Plante wrote:
> Hi all,
> 
> Thanks for everyone who provided comments to the ID proposal.  As I've 
> mentioned, I felt like it was in need of wider discussion.  And while the 
> discussion may seem to have gotten a bit chaotic, I think we can distill 
> some cogent issues.  From my reading, I believe the comments that have 
> been raised address the following questions:
> 
>   1. Should IDs carry any semantic intformation?
> 
>   2. Who chooses/controls what is contained in the components of a 
>      specific ID?  We specifically discussed who chooses the AuthorityID.
> 
>   3. How do IDs address the problem of transience and replication?  Do 
>      they need to do more?  
> 
> In this message I will address each in turn.  While it is true that these 
> are interrelated, I think conflating these makes it harder to make 
> progress.  I encourage others to attempt to separate these as is possible.  
> 
> Here's my punchline ahead of time:  I think all of the concerns raised can 
> be addressed with a minor adjustsment to the original proposal regarding 
> who controls AuthorityIDs.  
> 
> --------------------------------------------------------------------------
> 1. Should IDs carry any semantic information?
> --------------------------------------------------------------------------
> 
> Personally, I'm finding the myriad analogies to the book industry, email,
> and stock tickers of limited value as they too quickly vear off the mark
> and confuse the issue.  The best analogy for a VO identifier (which is so
> close, it could cease to be an analogy) is one we all understand: URLs.  
> Do URLs carry semantic content?  Sure they do: from a URL, we can often
> deduce all sorts of things about what it points to.  Is there a standard
> for how semantic meaning is encoded?  Absolutely not.  Do machines
> universally rely on interpreting the semantic content?  No.  (In general,
> the programs that do "micro-parse" the URLs are necessarily controlled by
> the same people that control the content.)  The ID proposal intends no
> more than this.
> 
> Whether or not the URL characters contains anything meaningful to anyone 
> does not affect the ability of browsers and servers to talk to each other.  
> Nevertheless, I think we can say that it is incredibly helpful that we can 
> put little messages into them that help humans remember them, copy them 
> without error, and debug the systems that use them.  This brings up a 
> related question: are URLs intended for human consumption?  The answer is, 
> no, normally not.  When hidden behind highlighted text, they can usually 
> be ignored.  Nevertheless, humans do occasionally handle them directly.  
> 
> An advantage of adopting a URI-based identifier allows for this same 
> flexibility in a manner that people are used to in URLs.  (The XML version 
> is equivalent in composition; however, the parseable components are tagged 
> individually to allow easier handling through XML parsers.)  Where the URL 
> analogy *potentially* breaks down is addressed in the next section.
> 
>  Q:  Should IDs carry any semantic information?
>  A:  They can, but they are not required to.  More precisely, the ID 
>      standard and its use in standard registry interfaces should not rely 
>      on it.  
> 
> Is this acceptable?
> 
> ---------------------------------------------------------------------------
> 2.  Who controls the components of an ID? 
> ---------------------------------------------------------------------------
> 
> Back in February, the NVO project generated a set of requirements for IDs; 
> one of them stated that the framework should maxmimize the freedom of data 
> providers to choose identifiers for resources under their control.  This 
> was the major point of discussion of the NVO telecon.  
> 
> The ID proposal intended that the AuthorityID (which would typically look 
> like a DNS name) would be strictly associated with a standard registry 
> interface.  In my mind, this was simply a mechanism to help ensure that 
> IDs in total are globally unique: once a registry's AuthorityID is 
> determined unique, the registry need only ensure that all its ResourceKeys 
> are locally unique.  Thus, the AuthorityID establishes a namespace that 
> the registry ultimately controls.  Thus, the data provider does not 
> control the namespace *unless* they decide to run their own registry.  It 
> was assumed that most providers would run their own, so this restriction 
> would only affect a few (?) smaller providers.  
> 
> This is where the URL analogy breaks down.  A URL assumes that there is a
> service running on the machine with the DNS name matching the URL's
> host-id component.  The intention of VO ID proposal was similar but a bit
> more vague: there would be a registry interface running on the registry
> machine that given the ID could return a resource description.  However,
> that interface is not yet defined, and it was not determined if the ID
> should be automatically convertable to a service interface URL/handle.  
> 
> Critics of the proposal suggested that the choice of a AuthorityID, 
> which establishes a namespace, should be controlled by the registrant.
> This would allow organizations to have complete control over their own 
> namespace without having to implement any standard registry service.  If 
> we have full registries that really do contain all registered resources, 
> then we do not need the AuthorityID to be tied to the registry where the 
> resource was first registered.  
> 
> It is worth noting that regardless of who controls the AuthorityID,
> introducing a new one will always require that it be checked against a
> VO-wide registry of namespaces to determine if it has been used
> before.  Thus, revising the proposal to tie the AuthorityID to an
> organization does not change how we determine if the AuthorityID is
> already in use.  It is harder, though, to ensure that the "owner" of
> the namespace retains sole control over its use:  if a publisher
> registers some resources in a namespace with one registry and some
> with another, both registries need to know that the publisher truely
> "owns" the namespace it is attempting refer to.  It can be done
> (e.g. with grid-based certificates).  
> 
> The fundemental question, though, is: does the ID specification need
> to be locked into the registry infrastructure.  At best, all the ID
> framework needs is a way to determine who owns an AuthorityID.  If the
> standard does not lock IDs into the registry infrastructure, then we
> can potentially allow a number of implementations--either
> simulataneously or a sequence that evolves over time--that encourage or
> enforce ID uniqueness.  This could include an implementation that
> requires the publisher run a particular registry service.
> 
>   Q: Who controls the components of an ID?
>   Original A:  the registry
>   Revised A:  the registering data provider.  Thus, the AuthorityID no 
>     longer implies the existance of any registry service specific to the 
>     AuthorityID.  The specification merely requires that AuthorityID's
>     are uniquely associated with organizations (or individuals) that
>     own them. 
> 
> ------------------------------------------------------------------------
> 3.  How do IDs address Transience and Replication? 
> ------------------------------------------------------------------------
> 
> This issue, as Arnold points out, touches on the need for having
> persistant names that can refer to a resource in perpetuity even when
> support of the resource changes over time or is replicated across
> multiple locations (See http://archives.us-vo.org/metadata/0762.html).  
> This is exactly what a URN (a type of URI) does.  
> 
> In my mind, the ID proposal does *not* address the use case Arnold
> described; that is, VO identifiers are not URNs.  In particular, if a
> data collection is mirrored at two different locations and thus
> accessible through interfaces with different URLs/handles, then the
> two mirrors are considered distinct and therefore have different
> resource identifiers.  VO identifiers are tied to an organization
> that maintains the resourse they identify via the AuthorityID.  If
> the access to a resource moves to a different machine, its ID need not
> change; the resource description it points to can be updated to the
> new location.  However, if curation is transfered to a new
> organization, the ID cannot persist unless ownership of the original
> namespace is transfered in whole as well.  
> 
> A URN scheme is certainly needed; however, we also need a way of
> distinguishing mirrors.  Thus, VO identifiers should not be URNs.  
> 
> A URN system will necessarily need to build on top of both the ID
> standard as well as registry interfaces.  In particular (as Arnold
> explains), registries should be able to map a URN to a set of matching
> identifiers that are mirrors of the same resource. 
> 
>  Q: How do IDs address the problem of transience and replication?  
>  A: They do not.  Replicated resources have different IDs; one must
>     consult the resources' metadata to know that they are mirrors.
>     IDs may persist when the resource moves around within its own
>     namespace; however, they cannot persist when the resource is
>     curated by a new organization with a differnent namespace.
> 
>  Q: Do they need to do more?
>  A: No.  A URN system should be built on top the ID standard and
>     registry interfaces.  
> 
> ---------------------------------------------------------------------
> In conclusion, I am recommending that an AuthorityID be "owned" and 
> controlled by a registering organization, but that the mechanism for 
> encouraging or enforcing that control not be part of the ID specification.  
> 
> My apologies for the length of this installment, but I hope it will help 
> focus our discussion. 
> 
> cheers,
> Ray