linking capabilities with tablesets

Fri Mar 22 09:32:34 CET 2024

Dear Registry,

On Thu, Mar 21, 2024 at 04:53:36PM +0100, gilles landais via registry wrote:
> <capability xsi:type="cs:ConeSearch"
> standardID="ivo://ivoa.net/std/ConeSearch">
>     <serves>
>         <reftable name='@table1' />
>         <reftable name='@table2' />
>     </serves>
> </capability>
>
> ...
>
> <tableset>
>     <schema>
>        <table>
>            <name>table1</name>
> ....

Hmha... this has come up now and then before, and I cannot say I'm
*particularly* keen on this kind of thing.  The main reason is that
inter-branch references in XML have a way of getting out of hand.  In
this particular case, also think about the discovery pattern once you
map this into RegTAP; everything I can think of off-hand looks fairly
ugly.

Still, *if* you want to go for it, the thing to touch is the SCS
standard, which should specify its Registry schema (taking over from
SimpleDALRegExt).  In there, for simplicity I'd just define an
element <queriesTable> (say), which would just contain the table name
without further syntax (did you have a special reason to add the "@"
in your example?).  Cone search only queries one table, so this could
be maxOccurs="1", which further simplifies usage.

In RegTAP, this would be mapped into rr.res_detail, the xpath would be
"/queriesTable".

But as I said: the ugly part of this is the client work; try an
implementation of the discovery of this before you start writing the
actual specification.

Me, I'd prefer a different way to clean this up.  Part of it is
metadata work, the rest may be protocol work.

Metadata work
-------------

Part of this problem is the VizieR policy to group all tables
belonging to one publication into one VO resource.  Admittedly, this
the the right thing to do in several contexts, in particular if
multiple tables primarily work as one unit.  A simple test would be:
Does it make sense to JOIN these tables?  If it is (classic example:
RegTAP), it should probably be a single resource.

In other important cases, and I believe many of the multiple-SCS
resources are of that sort, the publication-induces-resource policy
leads to clumsy discovery.

Let me invent a paper "Recent discoveries with the Volute Radio
Dish", which contains three tables, "Cataclysmic Stars", "Radio
Galaxies and QSOs", and "Solar System Objects".  I think these should
become three resources, two of which would have SCS services, an one
probably an EPN-TAP one.

Sure, you *could* make a resource that has all the relevant
capabilities, all the necessary subject keywords, and had a
description with three sections that discuss the various collections
in turn.

But that would be painful for a machine that, say, iterates over all
cone searches associated to records giving
#cataclysmic-variable-stars (or wider) as a subject; I don't think
there is a way for them to avoid hitting the QSO table and perhaps
even puzzle about the EPN-TAP service.

I always liked the term "unit of discovery"; what that is keeps
requiring thought and may even change over time as use cases change.
But I'm pretty sure at this time you should not, say, stick both
gaia_source and the light curves into one resource record (as in
I/355).  Think of metadata like "which product type do you serve?" as
in https://github.com/ivoa-std/VODataService/pull/1.

Punting such decisions down to the capability level is a recipe for
constant grief and feature creep on the levels of tables and
capabilities, which would keep growing metadata we already have
mapped at the resource level.  Properly dissecting tables and
services into well-fitting units of discovery saves that grief and
keeps the whole system manageable, with reasonable, at least
potentially expectable queries.

Of course I realise that even figuring out which of the existing
multi-SCS resources "should" (in my reasoning) be split up is a
herculanean task that's not easily tackled.  But it's probably not
orders of magnitude more complicated than teaching the SCS clients
the discovery patterns you need with the table references, not to
mention the effort of repeating resource-level metadata in tablesets
and capabilities.

By the way, that latter way would also include making the table
descriptions hit by the standard freetext queries (which neither
TOPCAT nor pyVO nor, to my knowledge, anyone else does these days).
That's another aspect of that I don't like.

For me (who's not VizieR), it's easy to say that I'd rather spend
work on aligning the metadata models than on doing client work, but
I'll say it anyway :-)

Potocol Work
------------

For the other cases, where there is, in essence, a single resource
that has multiple cone searches (perhaps: multiple epochs of the same
things or so), I believe the right way to deal with them is fixing
SCS.

That fix would be to let you declare just one SCS capability, but
the service then requires passing in a table name.  I distinctly
remember that SCS at one point already had a facility for passing in
table names, but I don't find it any more.

Let's do this again, and in the cases when there really *are*
multiple SCS-able tables in a resource, the associated SCS service
will send by a nice error message when clients don't pass in a table
name.  This then fixes the problem of people blindly querying a
random table when there is more than one -- they get an error message
if they're not explicit about what exactly they want.

Giving users some UI to choose the tables then actually *is* client
work, but it's client work that is in line with our current (and IMHO
reasonable) discovery practices.

Sorry for that sermon; but talking about what to discover when and how
never is simple, and we've gotten it wrong several times before in
the past.  Fixing things after the fact is even harder (cf.
discovering data collections...).

Thanks,

          Markus