Discovering Data Collections Within Services Note version 1.0

Fri Feb 12 11:53:04 CET 2016

Good morning Registry,

On Mon, Feb 08, 2016 at 02:53:20PM +0100, Kristin Riebe wrote:
> > Markus Demleitner wrote:
> > with just a single record still being preferred.
> 
> What happens if someone published a, say, SIA service for one dataset,
> using one combined service+dataset record for the registry. And later
> on, another two or three datasets are added for the same service --
> wouldn't that be the point when one goes back and rather splits the
> original service+dataset record into 2 records, in order to be able to
> just make a link to the service for each new data collection?

IVOA Identifiers has to say to that:

  Furthermore, the identifier SHOULD refer to at most one resource over
  all time; that is, IVOIDs should not be reused for unrelated
  resouces.  Note that a resource may potentially be dynamic (such as
  'weather at telescope' or 'current version of the standard') -- here,
  there is a conceptually unique resource, even though the content of
  it may change in time.

meaning -- there's some leeway.  But really, if you change a service
to move from a single-instrument archive (with the respective metadata
like "observed at instrument", "PI is XY", and a matching
description) to a thematic archive (with the respective
metadata like "multiple instruments", "publisher is creator", and a
description of the scope of the collection archive), I'd say it's a
different resource in almost all cases, so it should get a new
registry record and hence a new identifier.

The old record probably still wouldn't go; its SIA capability would
just be changed to an aux capability, and there would be a new
relationship to the thematic archive.

> So wouldn't that mean: Unless I am really sure that there will be just
> one dataset for one service, I should rather make separate records, one
> for the service and one for the dataset (with aux-cap.), in case I need
> to add additional datasets later on?

No, because it's fairly easy to turn the old "primary" record to an
auxiliary one if that's necessary.

> That second or third dataset could also be a new release/updated
> version. Or is there some other infrastructure already set up for
> versions of datasets?

Well, that's a bit orthogonal to the question here.  In principle,
it's possible to register each release separately and then switch the
assoicate discovery service (the main capability) between a, say,
"current" and "known-broken-archived", so this would help flexibly
support all kinds of schemes that keep multiple versions of data
collections alive; but exactly because I think the proposed discovery
scheme can essentially accomodate almost all ways to do this, this is
the wrong place to figure out what's a good idea in that particular
business and what is not.

> > transitions are a major concern.  Anything that changes the
> > behaviour of Registry components towards existing clients carries a
> > massive price tag.
> 
> Hm, still, from my (admittedly probably very naive) point of view it
> still looks like a "hack" to me. I understand the wish to not break
> existing services or validators, but would you really want to have that
> aux-thing inside the standardId in the long run?

Yes -- I maintain this is a completely valid, and indeed intended,
use of of what StandardsRegExt introduced the standard keys for.

> About how many legacy clients (validators etc.) are we talking here?

Hard to guess.  A handful within applications, another handful in
infrastructure, perhaps?  But of course that stuff is deployed across
thousands of machines, and getting all these installations upgraded
in a fairly short time is something I can't really see happening.

> >> * multiple auxiliary capabilities [...]
> > Yes, it is  a bit ugly that clients need to dereference the
> > references to the related resources to figure out which of them is
> > the main record, but the RegTAP query patterns are reasonable,
> > whereas adding something to relatedResource would again be a problem
> > in terms of migrating existing infrastructure.
> 
> What about adding the ivo-Id for the services (e.g.
> ivo://org.gavo.dc/lensunion/q/im) to the aux-capabilities instead? In
> addition to accessURL?
> Then this could give a direct link and no dereferencing is needed.

You're right; this was the solution we floated in Sesto, and
essentially decided for in Sydney.  Now, when I started writing the
schema in November, something immediately started shouting "don't do
this" somewhere in my head.  From a mail I sent to Mark back then:

  But now that I'm actually writing the schema, I wonder if I'm really
  overengineering the thing.  It simply doesn't feel right to define a
  new type to do something that is *almost* already done by
  relationship and actually re-uses the vr:ResourceName type used
  there, too.

There's also a minor technical point: What we would *really* want
here is relationships between *capabilities*, i.e., not only should
the source be a capability element, so should, in order to be true to
the theory, the target.  We don't really have a precendent for
referencing into resource records (except for StandardsRegExt keys,
which are in a whole different ballpark in that  respect), and
building one for something that in my view is definitely among the 20%
functionality that take 80% of the work seemed unwise to me.

So, yes, collection discovery as proposed here is an 80% solution.
But it does solve these 80% with 20% of the effort, and my feeling so
far is that the 80% solved probably are all anyone is ever going to
want to use.

My proposal at this point is: The current Note is easy and fairly
cheap to try out, and I hope the major TAP operators will push out
such records fairly soon (I'll push out some more too, soon).  If we
really see some important use case's requirements are clumsy to
satisfy with this, there's no big damage done if a (on the VOResource
level) minor correction becomes necessary later.

> > As to cleanliness and elegance -- well, that's for a good part in the
> > eye of the beholder.  To me, cleanliness and elegance in the Registry
> > by now are largely measured in "how hard is it to get the registry
> > operators to actually do it?"
> 
> It's a pity that it has to be reduced to that. But if that's the case,
> then I can see no alternatives to your approach.

"For a good part" isn't necessarly "has to be reduced to that".  But
well, certainly efficiency is an unashamed element of elegance, no?

Cheers,

         Markus