Discovering Data Collections Within Services Note version 1.0

Thu Jan 28 13:05:15 CET 2016

Hi Markus,

I don't have much experience with the registry, so I have a few
questions/remarks on your note. I apologize for the length of this mail
- it got much longer than initially intended.

* aux in full-capability approach:
Did I get it right that you propose the full-capability approach with
modifications, meaning that all the properties of a service (capability)
will be replicated for each record of a dataset that uses this service?
With the addition that capability-descriptions within data records will
get the "aux" added, but only if there is a separate record for that
service?
(Except for TAP, for which the full table description shall only be
available via the service-record.)

So if I have a service A that serves datasets 1 and 2, then I will have
3 records in the registry:

1) serviceA and its full description
2) dataset1 with its description
	+ full description of serviceA (with aux added)
3) dataset2 with its description
	+ full description of serviceA (with aux added)

That looks like a lot of duplicate information to me.

And if I have a service B that only serves one dataset 3, then I could
have two records:

1) serviceB and its full description
2) dataset3 with its description
	+ full description of serviceB (with aux added)

OR just one record:

1) dataset3 with its description
	+ full description of serviceB (and NO aux)

Is that correct?
So for the benefit of not necessarily having extra service-records for
those that have only one dataset (last example), you would be willing to
duplicate service information within each dataset description?
That doesn't sound like a good bargain to me.

I would very much favour the split-data approach described in 1.2.2. But
more on that below.

* aux in standardId (section 2.1):
Does it have to be inserted into the standardId? It somewhat obscures
the standardId and looks to me like "misusing" it.
Couldn't we just add another attribute to capability (e.g. "priority" or
"rank" or so) with values "main" or "aux"; or an "aux"-attribute (flag)
with values 1 or 0?
Yes, that would mean that clients querying for the main TAP-records
cannot only rely on the new standardId (as in sec. 2.2 of the note).
They would have to check the additional attribute instead.
But it would keep the standardId clean.

* multiple auxiliary capabilities - nested??
Sec. 2.1, p. 6, bottom: "records may have multiple auxiliary
capabilities, and therefore not every served-by record declared by a
resource necessarily corresponds to the main service for a given
auxiliary capability."
I have trouble understanding this.
So there may be multiple aux-capabilities given, then shouldn't there be
multiple relatedResource-entries for the servedBy-relationship as well?
So that I can find the main-entry for each aux-capability based on that?
Taking your example from
http://dc.zah.uni-heidelberg.de/oai.xml?verb=GetRecord&metadataPrefix=ivo_vor&identifier=ivo://org.gavo.dc/apo/res/apo/frames
there exist two aux. capabilities:
- ivo://ivoa.net/std/SIA#aux
- ivo://ivoa.net/std/TAP#aux
So I would expect a relatedResource entry at servedBy for the
corresponding TAP service (which is already there) and for the SIA
service (which is not given).

If you can omit links to main services, isn't that contrary to the
purpose of the servedBy-relationships?
Maybe you just wanted to state the point, that for multiple
aux-capabilities given, you have multiple relatedResources and it is not
immediately clear, which related Resource belongs to which
aux-capability. I guess one could infer that from the name of the link
or (even better?) one should add an attribute (e.g. the standardId?) to
each relatedResource that makes it clearer if and which type of
capability is linked here.

* Service enumeration constraining the version (Section 2.2)
I would not say that it is very unlikely to query for all available TAP
or SIA services, all versions of it. I expect that clients try to be
downward compatible, supporting as many versions as possible. So if I
have a TAP-client, I expect that client would want to show all the
TAP-services (regardless of the version or including all the versions it
supports, e.g. TAP-1.0, TAP-1.1. TAP-2.0).
Okay, maybe you don't want all the versions listed, because you can
never be sure if the client can deal with TAP-5.0 or so out of the box...

* Arguments against split-metadata approach (Sec. 1.2.2)
As you state there, it would be rather elegant to keep data collections
and services separated, in different registry records, and link them
with each other. I very much like that idea. But you have some arguments
against that:

1. hard to migrate
=> Whatever the new approach will be, the records will have to be
updated. Maybe this approach is harder, but I think it could be worth
the effort.

2. duplication of entries for services with one data set (1:1)
Couldn't one make a script that goes over all registry records of
datasets with just one service and duplicate them, turning one of the
records into a service-record and the other into the dataset-record,
with a link to the service?
Yes, that would double the entries for these services, but what's a few
ten-thousand lines more in a registry table compared to being cleaner
and more elegant?

3. more complicated queries
There's usually a relational databases behind the registry, is there
not? So joins of dataset and service records shouldn't be that hard.
One could even create a database view of the joined records, which in a
sense could imitate the full-capability approach.

So I still favour the split-metadata approach. I guess it may cause a
lot of trouble right now, but I guess it would be a better investment in
the future.

Cheers,

Kristin

P.S.: some style/typo remarks:
* Sec. 1.1, p. 3, paragraph with "CADC's Obscore table ..."
	Last sentence: "they should see ..."
	=> Maybe add "... but they don't."
	(Or even add, what users are getting instead.)
* Sec. 2.1, p.6, 3rd-last paragraph:
	"allow relationships between between"
	=> remove one "between"
* Sec. 2.3, p.8, "metadata of [...] collections themselves already has
registry records"
	=> "has" doesn't sound right, maybe "have"?

* Sec. 3, p.9, bottom: "Standard-sRegExt"
	=> the word separation at the line break before the 's' looks awkward.

On 01/14/2016 05:00 PM, Markus Demleitner wrote:
> Dear Colleagues,
> 
> I have just published a note on Discovering Data Collections Within
> Services in the IVOA document repository --
> http://ivoa.net/documents/Notes/DataCollect
> 
> What this is about, in brief: How do VO users find all the TAP tables
> we have out there, e.g., in VizieR or at IRSA or all those other
> wonderful data centers?  This particular rabbit hole goes a bit
> deeper, but the TAP issue is the most pressing use case.
> 
> This note does a few possibly objectionable things, in particular
> defining ivoids over which it really has no control; it also covers a
> few important data discovery use cases.
> 
> For both reasons, I would appreciate if people outside of the usual
> Registry gang could spare 30 minutes or so to try and understand what
> this is about and comment (even if slightly unsure), preferably *not*
> here but on registry at ivoa.net.
> 
> In case you're still hesitating: If there is no major outcry, I would
> like to propose this for an endorsed note.  Meaning: you should make
> up your mind about it fairly soon anyway...
> 
> Cheers,
> 
>          Markus
> 

-- 
-------------------------------------------------------
Dr. Kristin Riebe
E-Science & GAVO

Email: kriebe at aip.de
Phone: +49 331 7499-377
Room:  B6/25
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------