Registering data collections

Mon Sep 1 14:21:24 CEST 2014

Dear Registry WG,

Warning: this is a fairly long piece and a subtle matter, but this stuff
is important if we want to register complex-ish TAP services, keep
all-VO SIAP and SSAP queries feasable in the future while allowing
meaningful discovery, etc.  So, your attention is highly appreciated,
but it'll also be strained.  Only continue if you're in a serene mood
right now.

In Registry, we have been struggling with the problem of registering
data in a way that lets clients easily locate data access services for
that data.  One of the major reasons is that "federated" services (think
obscore or SIA services serving data from multiple sources) are a good
thing as they keep the number of services to hit in an all-VO query low,
but that registry queries should see the metadata of the individual
sources ("find quasar images from Mariana Trench observatory").

So, the situation we have is a bit like this (TAP serving as an example):

<Table 1> ----------       --------------- <relational registry>
                    \     /
    <Table 2> ---- [TAP Service] --------- <EPN-TAP table>
                    /     \
<Table 3> ----------       ----------------- <obscore table>

-- and people should, with simple registry queries, be able to find
Table 1 or the obscore table via their metadata *and* figure out the
access URL of the TAP Service where you can actually work with them.
Plus the TAP service itself should be discoverable, of course, by a
query like "give me all TAP services" (that would not return the Tables
and the other resources, if at all possible, to avoid confusion when
asking people "Which TAP service should I use?").

The obvious idea was to use relationships.  Essentially, a
DataCollection (Table 1, obscore, etc) would say it is servedBy the data
service; this plan was discussed at the Urbana interop
(http://wiki.ivoa.net/internal/IVOA/InterOpMay2012Registry/dem-vods.pdf)

However, this plan has the drawback that clients will have to query
through whatever is used to expose relationships.  This means either one
registry query per record or fairly messy queries, as discussed for
RegTAP at the Hawaii interop
(http://wiki.ivoa.net/internal/IVOA/InterOpSep2013Registry/regtap.pdf,
section "uneasy relationships").

So, an alternative approach was proposed at the Madrid interop, in which
essentially DataCollections would grow capabilities
(http://wiki.ivoa.net/internal/IVOA/InterOpMay2014Registry/Plante-RWGMay2014.pdf);
instead of (or in addition to) the relationship links, the resource
records would simply contain the relevant capabilities (e.g., SIA, TAP)
of the services serving them.  To avoid a schema change, we agreed to
simply use CatalogService (or DataService) records to register
DataCollections and probably phase out DataCollections, but that's just
an implementation detail (I guess).

I've tried that, and immediately I got complaints.  The main reason is
that registry queries get "poisoned".  Imagine a TAP service publishing
20 data collections which are individually registred -- which is not at
all unreasonable.  When a client now asks for "all TAP services defined
in the VO" (as TOPCAT does), it gets 20 identical access URLs.  That's
obviously not intended.

This effect goes from annoying to outright harmful with registry queries
like "give me all ObsCore services" (which asks for certain values of
dataModel *within capability*) -- an all-VO query would then hit the
service 20 times.  Well, clients could uniq on access_url, but it doesn't
feel right to require a DISTINCT in queries or tell clients to do
significant post-processing of what they get back from the registry.

Such discovery queries typically check for a certain standard_id on a
capability, maybe like this:

  SELECT ivoid, access_url 
  FROM rr.capability 
    NATURAL JOIN rr.resource
    NATURAL JOIN rr.interface
  WHERE standard_id='ivo://ivoa.net/std/sia'
    AND intf_type='vs:paramhttp'
    AND 1=ivo_hashlist_has('infrared', waveband)

-- so, one possibility would be to only have the actual service have the
"full" capability and would turn up in searches like that, whereas all
the data collections would have an "auxillary" standard_id; in this
case, maybe ivo://ivoa.net/std/sia#aux.  So, when someone were to look
for "all SIA resources having some physics" they would say

  ...
  WHERE standard_id like 'ivo://ivoa.net/std/sia%'
  ...

-- and then be aware that duplicate services are likely in the result.
Those hopefully wouldn't hurt as these lists would typically be shown to
the user to inspect their metadata (titles, authors, etc -- those would
all be different between the various records) rather than use their
access URLs directly.

So, essentially, there would be "discover for all-VO query"-type queries
using the "primary" standardID and "discover for particular
resource"-type queries that would allow both primary and secondary
standardIDs.  I'll admit I'm not sure if these two cases are always
terribly clear-cut; it'd really be up to the client authors to figure
that out.

One field where I struggle is discovery of TAP services implementing
data models -- which is actually one of the drivers of this.  The scheme
is that, within capability, there is a data model element saying things
like "there's an obscore table in here" or "we have the relational
registry".

The question is: would the secondary capabilities have these, too?  If
not, that's an implementation liability (the capability element would
need to now where it is) as well as conceptually difficult (so, where
*do* the data model elements turn up?  In the individual data
collections registering the individual tables perhaps?).  If they do,
then "give me all obscore services"-type queries would have to check the
standardId -- which may not be a big deal.

There's also the issue of the VOSI capabilites (tableMetadata,
capabilites, availability).  In normal registry records, these are (or
should be) present.  Repeating them for the subordinate service seems
excessive, even more so since for TAP, where existing clients use them,
their URLs are computable from the access URL.  But of course that's not
always true -- there could be subordinate datalink services, for
instance.  I guess I'll draw cloudy shapes into the air here and mutter
"relationship".

Well -- thanks for making it here.

If you have good ideas how to tackle this mess, please do speak up.
Even if your idea may seem bad at first.

Me, I'm leaning towards trying the "auxillary" capabilities -- these at
least shouldn't break anything.  I'd then start to push capabilities
with standardIds

ivo://ivoa.net/std/ssa#aux
ivo://ivoa.net/std/tap#aux
ivo://ivoa.net/std/sia#aux

soon -- future standardIds for these might then be 
ivo://ivoa.net/std/tap#1.1-aux.

But convincing me that's hare-brained shouldn't be hard.  After all, the
my previous two proposals didn't turn out all that great, so my
self-confidence in this matter has taken a hit or two.

Cheers,

        Markus