Registering data collections

Mon Sep 15 11:43:00 CEST 2014

Hi Paul,

On Mon, Sep 08, 2014 at 12:50:19PM +0000, Paul Harrison wrote:
> I am not sure that I have a solution to this, but I make the
> observation that amongst the design goals of using the original
> relationships idea, was that the registry should be normalised so
> that there was only one e.g. SDSS (Data collection) record, so that
> in principle it was easy to query for services related to that Data
> Collection. There is a practical resource curation reason for
> multiple (diversely owned) Data Services to point to the single
> Data Collection record rather than try to add anything to the
> existing record in that the Data Service owner might not be able to
> edit the Data Collection record if they do not have permission to
> edit in the owning Authority. I think that in principle the data
> model of the resource registry does allow for the expression of all
> the necessary relationships, whilst keeping this curatorial
> convenience.

I think it is important to keep this model in mind, as it certainly
would be appealing; however, given the CatalogServices have tablesets
and essentially all registry records don't work like this, I think
it's too late to establish this as the canonical model.

Now, is it preferable to have a part of the Registry "normalised"
like that, while a substantial rest would remain data+service in one?
A while ago, I'd have said "let's see", but my current "no" is, I
think, based not so much on technology but on a general skepticism
towards mixing paradigms.

> I think that I follow your argument that it is difficult/impossible
> to form queries with ADQL that can do the sorts of ?find me the
> services that give me column x from data collection y? queries, but
> that seems to be the fault of ADQL not being at least as expressive
> as SQL (e.g. missing unions) - so it would seem better to me to add
> facilities to ADQL to solve this problem rather than corrupting the
> registry data model (or at the very least introducing curatorial
> inconveniences).

The lack of a UNION operator in ADQL plays a role (with it, queries
like "direct services UNION indirect services" would be much more
straightforward), but I don't think it's the one that seals the deal.
What seals the deal is that regardless of query technology, once it's
powerful enough to be interesting, people would have to, in effect,
write two queries, one direct, one going through relationship.  I
cannot see a way around this (it's a bit like Gödel's theorem, that
one).

So, I am by now convinced is has to be one or the other, if at all
possible.

"Indirect" registration everywhere would be preferable, but it
doesn't seem to be feasible.  So, adding the small "auxiliary"
capabilities to the data collection records would be next on my list
in order of preferability.

> designed with the ?low level? mindset, by which I mean that the
> protocol/query language should be capable of being maximally
> expressive which might mean that only ?experts? can form queries. I

True -- I take no issue with that.  However, designing such that it's
easier to do the right thing than to do the wrong thing is, I think,
an important goal, too.  Here, the limitation *may* be a bit specific
to RegTAP, but again I have a feeling the intellectual effort
required to figure out what to join on what else when joining through
relationship is indicative of a deeper property of the data model
(N.B. I'm not suggesting we do away with relationship -- I'm just
saying it's comparatively tricky when used in somewhat more complex
queries against the Registry schema).

> do not think that this difficulty should concern us too much, as
> most of the interaction by a "non-expert" will most often be via
> specialised client software which can hide much of the complexity
> from the end user. A good client could package the ?find me the
> services that give me column x from data collection y? query behind
> which might actually be multiple ?low level" queries to the
> registry.

My concern is actually more "find all SIA services talking about
galaxies" also finding a SIA service that just exposes a data
collection talking about galaxies (as well as SIA services that have
"galaxies" in their description).  And having to make two distinct
queries for that would really complicate things.

But Mark's observation was of course right: We should be writing down
the use cases for registering DataCollections and using these records
more explicitely.  Or rather, non-use cases, as this is mainly about
the Registry hopefully doing the right thing magically.

I'll try and give these at the interop -- if you have use cases to
contribute, let me know.

Cheers,

            Markus