Registering data collections

Mon Sep 8 12:18:48 CEST 2014

Markus,

it's obvious, since the first couple of attempts at this have had
their problems, that the answer is not obvious.  I wonder if that
is partly because it's not clear what problem we're trying to solve.

One way to tackle it would be to make sure we've got a clear idea
of the queries we are trying to support: assemble a list of use
cases that might relate to federated services etc, and work
backwards from there to design of the relevant parts
of the model/interface.  Not all use cases need to be supported
with equal ease: the RegTAP model should be capable of supporting
all kinds of rich queries about relationships between resources,
but common ones, and especially those which might be issued
by humans writing ADQL, ought to be reasonably easy to do.
To some extent I'm stating the obvious, but the fact that
the previous attempts have presented nasty surprises in practice
suggest that the thinking has been top-down rather than bottom-up
before now.

>From that point of view the registry client paper you're currently
preparing, in particular the "Common Registry Queries" section,
could be a good tool for thinking about this.

The auxiliary capabilities thing sounds like it might do the job
(though you couldn't call it especially elegant), but without
looking at how it would work for the particular use cases we care
about, it's hard to be sure.

Just my 0.02.

Mark

On Mon, 1 Sep 2014, Markus Demleitner wrote:

> Dear Registry WG,
> 
> Warning: this is a fairly long piece and a subtle matter, but this stuff
> is important if we want to register complex-ish TAP services, keep
> all-VO SIAP and SSAP queries feasable in the future while allowing
> meaningful discovery, etc.  So, your attention is highly appreciated,
> but it'll also be strained.  Only continue if you're in a serene mood
> right now.
> 
> In Registry, we have been struggling with the problem of registering
> data in a way that lets clients easily locate data access services for
> that data.  One of the major reasons is that "federated" services (think
> obscore or SIA services serving data from multiple sources) are a good
> thing as they keep the number of services to hit in an all-VO query low,
> but that registry queries should see the metadata of the individual
> sources ("find quasar images from Mariana Trench observatory").
> 
> So, the situation we have is a bit like this (TAP serving as an example):
> 
> <Table 1> ----------       --------------- <relational registry>
>                     \     /
>     <Table 2> ---- [TAP Service] --------- <EPN-TAP table>
>                     /     \
> <Table 3> ----------       ----------------- <obscore table>
> 
> -- and people should, with simple registry queries, be able to find
> Table 1 or the obscore table via their metadata *and* figure out the
> access URL of the TAP Service where you can actually work with them.
> Plus the TAP service itself should be discoverable, of course, by a
> query like "give me all TAP services" (that would not return the Tables
> and the other resources, if at all possible, to avoid confusion when
> asking people "Which TAP service should I use?").
> 
> The obvious idea was to use relationships.  Essentially, a
> DataCollection (Table 1, obscore, etc) would say it is servedBy the data
> service; this plan was discussed at the Urbana interop
> (http://wiki.ivoa.net/internal/IVOA/InterOpMay2012Registry/dem-vods.pdf)
> 
> However, this plan has the drawback that clients will have to query
> through whatever is used to expose relationships.  This means either one
> registry query per record or fairly messy queries, as discussed for
> RegTAP at the Hawaii interop
> (http://wiki.ivoa.net/internal/IVOA/InterOpSep2013Registry/regtap.pdf,
> section "uneasy relationships").
> 
> So, an alternative approach was proposed at the Madrid interop, in which
> essentially DataCollections would grow capabilities
> (http://wiki.ivoa.net/internal/IVOA/InterOpMay2014Registry/Plante-RWGMay2014.pdf);
> instead of (or in addition to) the relationship links, the resource
> records would simply contain the relevant capabilities (e.g., SIA, TAP)
> of the services serving them.  To avoid a schema change, we agreed to
> simply use CatalogService (or DataService) records to register
> DataCollections and probably phase out DataCollections, but that's just
> an implementation detail (I guess).
> 
> I've tried that, and immediately I got complaints.  The main reason is
> that registry queries get "poisoned".  Imagine a TAP service publishing
> 20 data collections which are individually registred -- which is not at
> all unreasonable.  When a client now asks for "all TAP services defined
> in the VO" (as TOPCAT does), it gets 20 identical access URLs.  That's
> obviously not intended.
> 
> This effect goes from annoying to outright harmful with registry queries
> like "give me all ObsCore services" (which asks for certain values of
> dataModel *within capability*) -- an all-VO query would then hit the
> service 20 times.  Well, clients could uniq on access_url, but it doesn't
> feel right to require a DISTINCT in queries or tell clients to do
> significant post-processing of what they get back from the registry.
> 
> Such discovery queries typically check for a certain standard_id on a
> capability, maybe like this:
> 
>   SELECT ivoid, access_url 
>   FROM rr.capability 
>     NATURAL JOIN rr.resource
>     NATURAL JOIN rr.interface
>   WHERE standard_id='ivo://ivoa.net/std/sia'
>     AND intf_type='vs:paramhttp'
>     AND 1=ivo_hashlist_has('infrared', waveband)
> 
> -- so, one possibility would be to only have the actual service have the
> "full" capability and would turn up in searches like that, whereas all
> the data collections would have an "auxillary" standard_id; in this
> case, maybe ivo://ivoa.net/std/sia#aux.  So, when someone were to look
> for "all SIA resources having some physics" they would say
> 
>   ...
>   WHERE standard_id like 'ivo://ivoa.net/std/sia%'
>   ...
> 
> -- and then be aware that duplicate services are likely in the result.
> Those hopefully wouldn't hurt as these lists would typically be shown to
> the user to inspect their metadata (titles, authors, etc -- those would
> all be different between the various records) rather than use their
> access URLs directly.
> 
> So, essentially, there would be "discover for all-VO query"-type queries
> using the "primary" standardID and "discover for particular
> resource"-type queries that would allow both primary and secondary
> standardIDs.  I'll admit I'm not sure if these two cases are always
> terribly clear-cut; it'd really be up to the client authors to figure
> that out.
> 
> One field where I struggle is discovery of TAP services implementing
> data models -- which is actually one of the drivers of this.  The scheme
> is that, within capability, there is a data model element saying things
> like "there's an obscore table in here" or "we have the relational
> registry".
> 
> The question is: would the secondary capabilities have these, too?  If
> not, that's an implementation liability (the capability element would
> need to now where it is) as well as conceptually difficult (so, where
> *do* the data model elements turn up?  In the individual data
> collections registering the individual tables perhaps?).  If they do,
> then "give me all obscore services"-type queries would have to check the
> standardId -- which may not be a big deal.
> 
> There's also the issue of the VOSI capabilites (tableMetadata,
> capabilites, availability).  In normal registry records, these are (or
> should be) present.  Repeating them for the subordinate service seems
> excessive, even more so since for TAP, where existing clients use them,
> their URLs are computable from the access URL.  But of course that's not
> always true -- there could be subordinate datalink services, for
> instance.  I guess I'll draw cloudy shapes into the air here and mutter
> "relationship".
> 
> 
> Well -- thanks for making it here.
> 
> If you have good ideas how to tackle this mess, please do speak up.
> Even if your idea may seem bad at first.
> 
> Me, I'm leaning towards trying the "auxillary" capabilities -- these at
> least shouldn't break anything.  I'd then start to push capabilities
> with standardIds
> 
> ivo://ivoa.net/std/ssa#aux
> ivo://ivoa.net/std/tap#aux
> ivo://ivoa.net/std/sia#aux
> 
> soon -- future standardIds for these might then be 
> ivo://ivoa.net/std/tap#1.1-aux.
> 
> But convincing me that's hare-brained shouldn't be hard.  After all, the
> my previous two proposals didn't turn out all that great, so my
> self-confidence in this matter has taken a hit or two.
> 
> Cheers,
> 
>         Markus
> 
> 

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/