Discovering Data Collections Within Services Note version 1.0

Mon Feb 8 14:53:20 CET 2016

Hi Markus & the registry,

>> * aux in full-capability approach:
>> Did I get it right that you propose the full-capability approach with
>> modifications, meaning that all the properties of a service (capability)
>> will be replicated for each record of a dataset that uses this service?
> 
> Not necessarily, and at least for the TAP case I propose untyped
> capabilities, as given in the examples.

Okay, so it won't be the full service record that shall be given with
the data, but only it's standardId, a link (URL) to its endpoint and a
publisher? That sounds much better.

> with just a single record still being preferred.

What happens if someone published a, say, SIA service for one dataset,
using one combined service+dataset record for the registry. And later
on, another two or three datasets are added for the same service --
wouldn't that be the point when one goes back and rather splits the
original service+dataset record into 2 records, in order to be able to
just make a link to the service for each new data collection?

So wouldn't that mean: Unless I am really sure that there will be just
one dataset for one service, I should rather make separate records, one
for the service and one for the dataset (with aux-cap.), in case I need
to add additional datasets later on?

That second or third dataset could also be a new release/updated
version. Or is there some other infrastructure already set up for
versions of datasets?

>> * aux in standardId (section 2.1):
>> Does it have to be inserted into the standardId? It somewhat obscures
>> the standardId and looks to me like "misusing" it.
> 
> We-ell, I don't agree that this is somehow "soiling" the standard id.
> Technically, a standard can define any number of "terms" (that's the
> "key" element from StandardsRegExt).  These terms can refer to all
> kind of things: endpoint types, output formats, whatever.
> 
> What we do here is define an endpoint type, i.e., a TAP endpoint
> whose metadata are defined somewhere else.  I'd claim this is fairly
> well along the lines of both StandardsRegExt and the usage that
> capabilitiy/@standardID has found in VO practice.
> 
> But even if it were a minor bending of the rules (it definitely is
> for the transition-phase identifiers in section 3), adding another
> attribute has the big disadvantage that legacy clients will ignore
> this -- this means that a TAP validator might re-validate VizieR
> 15000 times.  Well, this particular service would get fixed (or
> blocked by VizieR) fairly quickly, but don't forget that there's
> quite a bit of infrastructure using the Registry,  and so smooth
> transitions are a major concern.  Anything that changes the
> behaviour of Registry components towards existing clients carries a
> massive price tag.

Hm, still, from my (admittedly probably very naive) point of view it
still looks like a "hack" to me. I understand the wish to not break
existing services or validators, but would you really want to have that
aux-thing inside the standardId in the long run?
About how many legacy clients (validators etc.) are we talking here?

>> * multiple auxiliary capabilities [...]
>> So I would expect a relatedResource entry at servedBy for the
>> corresponding TAP service (which is already there) and for the SIA
>> service (which is not given).
> 
> But it is (there is a problem in that the relationship to the TAP service
> is given twice, which is because my machinery doesn't realise the
> ObsCore and the TAP services are the same; I'll probably fix that).
> But the record correctly declares a served-by each to
> 
> ivo://org.gavo.dc/tap and ivo://org.gavo.dc/lensunion/q/im

Oh, I see, I just expected something explicitly containing "SIA"
somewhere, and so I missed that.

>> or (even better?) one should add an attribute (e.g. the standardId?) to
>> each relatedResource that makes it clearer if and which type of
> 
> Yes, it is  a bit ugly that clients need to dereference the
> references to the related resources to figure out which of them is
> the main record, but the RegTAP query patterns are reasonable,
> whereas adding something to relatedResource would again be a problem
> in terms of migrating existing infrastructure.

What about adding the ivo-Id for the services (e.g.
ivo://org.gavo.dc/lensunion/q/im) to the aux-capabilities instead? In
addition to accessURL?
Then this could give a direct link and no dereferencing is needed.

> Well, updating 15000 records is not as bad as having to create
> another 15000.  Even worse, it'd have to be more or less
> instantaneous to give the client writers a chance to maintain their
> sanity.
> 
> And then all legacy clients would immediately break.
> 
> I'm not saying that's totally out of the question forever -- perhaps
> one could keep some "legacy" searchable registries at the state
> before the flag day for a couple of years.  But I think we'd have to
> have *very* tangible and substantial benefits to make that
> worthwhile, and I cannot see them in this case.

Hm, I would say: better clean up now than later on. Later we would have
even more records to repair.
(But maybe "later" never happens or a completely different approach with
a lot of "substantial benefits" will come along. Who knows. :-)
So maybe then it is better to not break existing clients etc.)

> As to cleanliness and elegance -- well, that's for a good part in the
> eye of the beholder.  To me, cleanliness and elegance in the Registry
> by now are largely measured in "how hard is it to get the registry
> operators to actually do it?"

It's a pity that it has to be reduced to that. But if that's the case,
then I can see no alternatives to your approach.

>> 3. more complicated queries
> 
> No, this is actually a very conceptual concern, and *that* was what
> finally convinced me that even migrating in that direction wouldn't
> fly.  I was sure I had nicely laid that out at a recent interop talk,
> but I can't seem to find that now.  Well, Fig. 4 from
> http://ads.ari.uni-heidelberg.de/abs/2015A%26C....11...91D will do,
> too: The problem when you do this is that there are essentially four
> classes of tables (or equivalently, metadata) when it comes to joins
> with that scheme.  This is what made me decide the split-metadata
> approach won't fly -- there's too much to explain before people can
> write queries.

It seems to me that there is still much to explain with this approach.
It took me quite some time to go through your note ...

> Truth be told, my instinct was a bit like yours for a long while --
> let's go to a Registry with a clear separation of data collections
> and services?  After I discovered how ugly the queries become unless
> we totally re-built the Registry, I now have my doubts.  Perhaps it's
> for the better that our forefathers built the Registry the way they
> did.

Not really knowing how much effort that would be, I would even vote for
rebuilding the registry, and making a clean separation between datasets
and services. It sounds like this would be a lot of effort now, invested
into a cleaner system for the future.
But of course people would have to be willing to invest into this
effort, and rewrite their systems and clients where necessary.

Cheers,

Kristin

-- 
-------------------------------------------------------
Dr. Kristin Riebe
E-Science & GAVO

Email: kriebe at aip.de
Phone: +49 331 7499-377
Room:  B6/25
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------