Discovering Data Collections Within Services Note version 1.0

Mon Feb 1 15:02:37 CET 2016

Hi Kristin,

On Thu, Jan 28, 2016 at 01:05:15PM +0100, Kristin Riebe wrote:
> * aux in full-capability approach:
> Did I get it right that you propose the full-capability approach with
> modifications, meaning that all the properties of a service (capability)
> will be replicated for each record of a dataset that uses this service?

Not necessarily, and at least for the TAP case I propose untyped
capabilities, as given in the examples.

> So if I have a service A that serves datasets 1 and 2, then I will have
> 3 records in the registry:
> 
> 1) serviceA and its full description

...containing metadata about the service (e.g., "GAVO TAP service",
operated by me, at ARI...) and the full TAPRegExt capability
(declaring user defined functions, etc).

> 2) dataset1 with its description
> 	+ full description of serviceA (with aux added)

...describing the dataset (e.g., "PPMXL catalog", created by Röser, S.,
et al, having columns ra, dec, pmra, pmdec...), plus a relationship
to serviceA, but no, not the full capability (see also the examples
in the appendix).

> 3) dataset2 with its description
> 	+ full description of serviceA (with aux added)
> 
> That looks like a lot of duplicate information to me.

No, there shouldn't be any, because the records really describe
different things.  Well, the publisher should typically be identical,
of course.

> And if I have a service B that only serves one dataset 3, then I could
> have two records:
> 
> 1) serviceB and its full description
> 2) dataset3 with its description
> 	+ full description of serviceB (with aux added)
> 
> OR just one record:
> 
> 1) dataset3 with its description
> 	+ full description of serviceB (and NO aux)

Right (except for the full description of serviceB in (2) of the
first alternative; that's of course again just an aux capability) --
with just a single record still being preferred.

> So for the benefit of not necessarily having extra service-records for
> those that have only one dataset (last example), you would be willing to
> duplicate service information within each dataset description?
> That doesn't sound like a good bargain to me.

It wouldn't be unless 99% of the records in the current registry
followed the second pattern and so switching to the first would
double the entries in the current registry (not to speak of  takeup
problems).

> * aux in standardId (section 2.1):
> Does it have to be inserted into the standardId? It somewhat obscures
> the standardId and looks to me like "misusing" it.
> Couldn't we just add another attribute to capability (e.g. "priority" or
> "rank" or so) with values "main" or "aux"; or an "aux"-attribute (flag)
> with values 1 or 0?
> Yes, that would mean that clients querying for the main TAP-records
> cannot only rely on the new standardId (as in sec. 2.2 of the note).
> They would have to check the additional attribute instead.
> But it would keep the standardId clean.

We-ell, I don't agree that this is somehow "soiling" the standard id.
Technically, a standard can define any number of "terms" (that's the
"key" element from StandardsRegExt).  These terms can refer to all
kind of things: endpoint types, output formats, whatever.

What we do here is define an endpoint type, i.e., a TAP endpoint
whose metadata are defined somewhere else.  I'd claim this is fairly
well along the lines of both StandardsRegExt and the usage that
capabilitiy/@standardID has found in VO practice.

But even if it were a minor bending of the rules (it definitely is
for the transition-phase identifiers in section 3), adding another
attribute has the big disadvantage that legacy clients will ignore
this -- this means that a TAP valiadator might re-validate VizieR
15000 times.  Well, this particular service would get fixed (or
blocked by VizieR) fairly quickly, but don't forget that there's
quite a bit of infrastructure using the Registry,  and so smooth
transistions are a major concern.  Anything that changes the
behaviour of Registry components towards existing clients carries a
massive price tag.

> * multiple auxiliary capabilities - nested??
> Sec. 2.1, p. 6, bottom: "records may have multiple auxiliary
> capabilities, and therefore not every served-by record declared by a
> resource necessarily corresponds to the main service for a given
> auxiliary capability."
> I have trouble understanding this.
> So there may be multiple aux-capabilities given, then shouldn't there be
> multiple relatedResource-entries for the servedBy-relationship as well?

Sure there should, and as far as I can see, that's what the Note
says, and that's what the examples show, no?

> So that I can find the main-entry for each aux-capability based on that?
> Taking your example from
> http://dc.zah.uni-heidelberg.de/oai.xml?verb=GetRecord&metadataPrefix=ivo_vor&identifier=ivo://org.gavo.dc/apo/res/apo/frames
> there exist two aux. capabilities:
> - ivo://ivoa.net/std/SIA#aux
> - ivo://ivoa.net/std/TAP#aux
> So I would expect a relatedResource entry at servedBy for the
> corresponding TAP service (which is already there) and for the SIA
> service (which is not given).

But it is (there is a problem in that the relationship to the TAP service
is given twice, which is because my machinery doesn't realise the
ObsCore and the TAP services are the same; I'll probably fix that).
But the record correctly declares a served-by each to

ivo://org.gavo.dc/tap and ivo://org.gavo.dc/lensunion/q/im

> If you can omit links to main services, isn't that contrary to the
> purpose of the servedBy-relationships?

It would be, yes.

> Maybe you just wanted to state the point, that for multiple
> aux-capabilities given, you have multiple relatedResources and it is not
> immediately clear, which related Resource belongs to which
> aux-capability. I guess one could infer that from the name of the link

Right.

> or (even better?) one should add an attribute (e.g. the standardId?) to
> each relatedResource that makes it clearer if and which type of

Yes, it is  a bit ugly that clients need to dereference the
references to the related resources to figure out which of them is
the main record, but the RegTAP query patterns are reasonable,
whereas adding something to relatedResource would again be a problem
in terms of migrating existing infrastructure.

> * Service enumeration constraining the version (Section 2.2)
> I would not say that it is very unlikely to query for all available TAP
> or SIA services, all versions of it. I expect that clients try to be
> downward compatible, supporting as many versions as possible. So if I
> have a TAP-client, I expect that client would want to show all the
> TAP-services (regardless of the version or including all the versions it
> supports, e.g. TAP-1.0, TAP-1.1. TAP-2.0).

Well, so they'd match against cap-_._ -- I'd say that's ok.

> * Arguments against split-metadata approach (Sec. 1.2.2)
> As you state there, it would be rather elegant to keep data collections
> and services separated, in different registry records, and link them
> with each other. I very much like that idea. But you have some arguments
> against that:
> 
> 1. hard to migrate
> => Whatever the new approach will be, the records will have to be
> updated. Maybe this approach is harder, but I think it could be worth
> the effort.

Well, updating 15000 records is not as bad as having to create
another 15000.  Even worse, it'd have to be more or less
instantaneous to give the client writers a chance to maintain their
sanity.

And then all legacy clients would immediately break.

I'm not saying that's totally out of the question forever -- perhaps
one could keep some "legacy" searchable registries at the state
before the flag day for a couple of years.  But I think we'd have to
have *very* tangible and substantial benefits to make that
worthwhile, and I cannot see them in this case.

> 2. duplication of entries for services with one data set (1:1)
> Couldn't one make a script that goes over all registry records of
> datasets with just one service and duplicate them, turning one of the
> records into a service-record and the other into the dataset-record,
> with a link to the service?

Well, if there was just one registry, yes, that might be possible
(although separating service from resource metadata automatically is
at least error-prone).  But we have lots of publishing registries out
there, some of which aren't terribly well maintained, and they are
based on several different architectures -- so, no, it's not that
simple in a distributed environment.

> Yes, that would double the entries for these services, but what's a few
> ten-thousand lines more in a registry table compared to being cleaner
> and more elegant?

Ah, the registry is not just one table, it's currently 13 tables, so
you underestimate the effort.  But that part would still be largely
done by the computer, so it's not the real problem.

As to cleanliness and elegance -- well, that's for a good part in the
eye of the beholder.  To me, cleanliness and elegance in the Registry
by now are largely measured in "how hard is it to get the registry
operators to actually do it?"

> 3. more complicated queries
> There's usually a relational databases behind the registry, is there
> not? So joins of dataset and service records shouldn't be that hard.
> One could even create a database view of the joined records, which in a
> sense could imitate the full-capability approach.

No, this is actually a very conceptual concern, and *that* was what
finally convinced me that even migrating in that direction wouldn't
fly.  I was sure I had nicely laid that out at a recent interop talk,
but I can't seem to find that now.  Well, Fig. 4 from
http://ads.ari.uni-heidelberg.de/abs/2015A%26C....11...91D will do,
to: The problem when you do this is that there are essentially four
classes of tables (or equivalently, metadata) when it comes to joins
with that scheme.  This is what made me decide the split-metadata
approach won't fly -- there's too much to explain before people can
write queries.

Truth be told, my instinct was a bit like yours for a long while --
let's go to a Registry with a clear separation of data collections
and services?  After I discovered how ugly the queries become unless
we totally re-built the Registry, I now have my doubts.  Perhaps it's
for the better that our forefathers built the Registry the way they
did.

Cheers,

          Markus