Registering data collections

Mon Sep 8 14:50:19 CEST 2014

> 
> On Mon, 1 Sep 2014, Markus Demleitner wrote:
> 
>> Dear Registry WG,
>> 
>> Warning: this is a fairly long piece and a subtle matter, but this stuff
>> is important if we want to register complex-ish TAP services, keep
>> all-VO SIAP and SSAP queries feasable in the future while allowing
>> meaningful discovery, etc.  So, your attention is highly appreciated,
>> but it'll also be strained.  Only continue if you're in a serene mood
>> right now.
>> 
>> In Registry, we have been struggling with the problem of registering
>> data in a way that lets clients easily locate data access services for
>> that data.  One of the major reasons is that "federated" services (think
>> obscore or SIA services serving data from multiple sources) are a good
>> thing as they keep the number of services to hit in an all-VO query low,
>> but that registry queries should see the metadata of the individual
>> sources ("find quasar images from Mariana Trench observatory").
>> 
>> So, the situation we have is a bit like this (TAP serving as an example):
>> 
>> <Table 1> ----------       --------------- <relational registry>
>>                    \     /
>>    <Table 2> ---- [TAP Service] --------- <EPN-TAP table>
>>                    /     \
>> <Table 3> ----------       ----------------- <obscore table>
>> 
>> -- and people should, with simple registry queries, be able to find
>> Table 1 or the obscore table via their metadata *and* figure out the
>> access URL of the TAP Service where you can actually work with them.
>> Plus the TAP service itself should be discoverable, of course, by a
>> query like "give me all TAP services" (that would not return the Tables
>> and the other resources, if at all possible, to avoid confusion when
>> asking people "Which TAP service should I use?").
>> 
>> The obvious idea was to use relationships.  Essentially, a
>> DataCollection (Table 1, obscore, etc) would say it is servedBy the data
>> service; this plan was discussed at the Urbana interop
>> (http://wiki.ivoa.net/internal/IVOA/InterOpMay2012Registry/dem-vods.pdf)
>> 
>> However, this plan has the drawback that clients will have to query
>> through whatever is used to expose relationships.  This means either one
>> registry query per record or fairly messy queries, as discussed for
>> RegTAP at the Hawaii interop
>> (http://wiki.ivoa.net/internal/IVOA/InterOpSep2013Registry/regtap.pdf,
>> section "uneasy relationships").
>> 
>> So, an alternative approach was proposed at the Madrid interop, in which
>> essentially DataCollections would grow capabilities
>> (http://wiki.ivoa.net/internal/IVOA/InterOpMay2014Registry/Plante-RWGMay2014.pdf);
>> instead of (or in addition to) the relationship links, the resource
>> records would simply contain the relevant capabilities (e.g., SIA, TAP)
>> of the services serving them.  To avoid a schema change, we agreed to
>> simply use CatalogService (or DataService) records to register
>> DataCollections and probably phase out DataCollections, but that's just
>> an implementation detail (I guess).
>> 
>> I've tried that, and immediately I got complaints.  The main reason is
>> that registry queries get "poisoned".  Imagine a TAP service publishing
>> 20 data collections which are individually registred -- which is not at
>> all unreasonable.  When a client now asks for "all TAP services defined
>> in the VO" (as TOPCAT does), it gets 20 identical access URLs.  That's
>> obviously not intended.
>> 
>> This effect goes from annoying to outright harmful with registry queries
>> like "give me all ObsCore services" (which asks for certain values of
>> dataModel *within capability*) -- an all-VO query would then hit the
>> service 20 times.  Well, clients could uniq on access_url, but it doesn't
>> feel right to require a DISTINCT in queries or tell clients to do
>> significant post-processing of what they get back from the registry.
>> 
>> Such discovery queries typically check for a certain standard_id on a
>> capability, maybe like this:
>> 
>>  SELECT ivoid, access_url 
>>  FROM rr.capability 
>>    NATURAL JOIN rr.resource
>>    NATURAL JOIN rr.interface
>>  WHERE standard_id='ivo://ivoa.net/std/sia'
>>    AND intf_type='vs:paramhttp'
>>    AND 1=ivo_hashlist_has('infrared', waveband)
>> 
>> -- so, one possibility would be to only have the actual service have the
>> "full" capability and would turn up in searches like that, whereas all
>> the data collections would have an "auxillary" standard_id; in this
>> case, maybe ivo://ivoa.net/std/sia#aux.  So, when someone were to look
>> for "all SIA resources having some physics" they would say
>> 
>>  ...
>>  WHERE standard_id like 'ivo://ivoa.net/std/sia%'
>>  ...
>> 
>> -- and then be aware that duplicate services are likely in the result.
>> Those hopefully wouldn't hurt as these lists would typically be shown to
>> the user to inspect their metadata (titles, authors, etc -- those would
>> all be different between the various records) rather than use their
>> access URLs directly.
>> 
>> So, essentially, there would be "discover for all-VO query"-type queries
>> using the "primary" standardID and "discover for particular
>> resource"-type queries that would allow both primary and secondary
>> standardIDs.  I'll admit I'm not sure if these two cases are always
>> terribly clear-cut; it'd really be up to the client authors to figure
>> that out.
>> 
>> One field where I struggle is discovery of TAP services implementing
>> data models -- which is actually one of the drivers of this.  The scheme
>> is that, within capability, there is a data model element saying things
>> like "there's an obscore table in here" or "we have the relational
>> registry".
>> 
>> The question is: would the secondary capabilities have these, too?  If
>> not, that's an implementation liability (the capability element would
>> need to now where it is) as well as conceptually difficult (so, where
>> *do* the data model elements turn up?  In the individual data
>> collections registering the individual tables perhaps?).  If they do,
>> then "give me all obscore services"-type queries would have to check the
>> standardId -- which may not be a big deal.
>> 
>> There's also the issue of the VOSI capabilites (tableMetadata,
>> capabilites, availability).  In normal registry records, these are (or
>> should be) present.  Repeating them for the subordinate service seems
>> excessive, even more so since for TAP, where existing clients use them,
>> their URLs are computable from the access URL.  But of course that's not
>> always true -- there could be subordinate datalink services, for
>> instance.  I guess I'll draw cloudy shapes into the air here and mutter
>> "relationship".
>> 
>> 
>> Well -- thanks for making it here.
>> 
>> If you have good ideas how to tackle this mess, please do speak up.
>> Even if your idea may seem bad at first.
>> 
>> Me, I'm leaning towards trying the "auxillary" capabilities -- these at
>> least shouldn't break anything.  I'd then start to push capabilities
>> with standardIds
>> 
>> ivo://ivoa.net/std/ssa#aux
>> ivo://ivoa.net/std/tap#aux
>> ivo://ivoa.net/std/sia#aux
>> 
>> soon -- future standardIds for these might then be 
>> ivo://ivoa.net/std/tap#1.1-aux.
>> 
>> But convincing me that's hare-brained shouldn't be hard.  After all, the
>> my previous two proposals didn't turn out all that great, so my
>> self-confidence in this matter has taken a hit or two.
>> 
>> Cheers,
>> 
>>        Markus
>> 
>> 

Hi,

I am not sure that I have a solution to this, but I make the observation that amongst the design goals of using the original relationships idea, was that the registry should be normalised so that there was only one e.g. SDSS (Data collection) record, so that in principle it was easy to query for services related to that Data Collection. There is a practical resource curation reason for multiple (diversely owned) Data Services to point to the single Data Collection record rather than try to add anything to the existing record in that the Data Service owner might not be able to edit the Data Collection record if they do not have permission to edit in the owning Authority. I think that in principle the data model of the resource registry does allow for the expression of all the necessary relationships, whilst keeping this curatorial convenience.

I think that I follow your argument that it is difficult/impossible to form queries with ADQL that can do the sorts of “find me the services that give me column x from data collection y” queries, but that seems to be the fault of ADQL not being at least as expressive as SQL (e.g. missing unions) - so it would seem better to me to add facilities to ADQL to solve this problem rather than corrupting the registry data model (or at the very least introducing curatorial inconveniences).

Another design conflict that has long troubled the registry (and other IVOA services for that matter) is the tension between the “high level” and the “low level” view of what the service/protocol should be able to do. I believe that the protocols should be designed with the “low level” mindset, by which I mean that the protocol/query language should be capable of being maximally expressive which might mean that only “experts” can form queries. I do not think that this difficulty should concern us too much, as most of the interaction by a "non-expert" will most often be via specialised client software which can hide much of the complexity from the end user. A good client could package the “find me the services that give me column x from data collection y” query behind which might actually be multiple “low level" queries to the registry.

The most important quality of the “low level” queries is that they give consistent answers against the registry model, and this is what RegTAP regularizes compared to the previous definition of the querying interfaces in Registry Interface 1.0. It should be remembered that the whole of the registry metadata is not that large, so if the only way to do some complex “high level” queries is by multiple passes, then I think that is a better solution than allowing various forms of denormalisation to make the data model a bit blurry.

Cheers,
	Paul.

p.s. As a historical note the old Astrogrid Desktop (which used Xquery against the XML version of the registry to get the full expressiveness that it needed in RegInterface 1.0) had several built-in high level queries (for which you only supplied the arguments) as well as its own simple “high level” query language (which got translated to Xquery) and the possibility of entering Xquery directly.