SODA gripes (1): The Big One

Fri Jan 22 16:14:18 CET 2016

Dear DAL,

On Fri, Jan 22, 2016 at 03:19:54PM +0100, Laurent Michel wrote:
> As understand the problem is to know how to get the meta-data of a given
> dataset. These meta-data are requested to get the value ranges of the SODA
> query parameters.

At the risk of sounding like a broken record: No, the metadata of a
given dataset are *not* enough to figure out the parameter metadata.
The latter are a function of the *service* *and* the dataset.  Think,
for instance, re-calibration, rebinning, etc, which are almost
exclusively dependent on the service (but may be dependent on the
dataset, too).  More cases in points, also for things within the
current CSP requirements, are in 

http://mail.ivoa.net/pipermail/dal/2016-January/007265.html

> That is unfortunately not possible with the actual DL service descriptors where only one range can be given for one parameter.
> There are 3 possibilities to sort this out
> A) Restricting DL response to one single dataset:
>     TO LATE

I don't think I understand the point you're trying to make here --
the situation with SODA is: you've discovered a dataset and now want
to slice and dice it.  So -- you already have either a PubDID and a
datalink service to turn it into a datalink document, or you have the
datalink document itself.  No?

What would you want to further restrict -- and why?

> B) Changing the the schema of the service descriptors to support
> multiple ranges for each parameter with a ref mechanism

What would a use case for multiple ranges be?  If this is about the
cursed thing with the DAL-embedded SODA descriptor, see below.

> C) Duplicating the service descriptors and declaring one resource per dataset
>     Could work with but messy VOTables.

Why would we want to do that?

I wanted to wait a bit until revisiting this whole "Big One" thing,
but perhaps waiting on won't help too much; and also, I think I've
understood where much of the antipathy to proper parameter metadata
comes from.

It's the cursed DAL-embedded SODA descriptor.  So, I'd ask you all to
forget for a moment it might exist at all.  I've invented it for a very
specific use case (mass cutouts around a spectral line), and while it
may be useful for other things, too, I now believe it has caused much
more damage than it's been good for.

For what we are talking about here, the slicing and dicing of data
cubes, it is largely useless, exactly because it doesn't contain
enough metadata.  And, as James pointed out, it can very well be an
implementation headache.

So, let's pretend it didn't exist; note that it was never required to
exist, and SODA doesn't need it.

SODA then is simply about defining semantics of certain parameters
you have in a datalink service descriptor.

Further, forget that datalink *services* admit passing in multiple
IDs.  That's a further optimisation that's not important for SODA.

Then, you can imagine the following situation, which I think is what
we should recommend to people that don't want to have lots of
complicated software:

  DAL/Obscore request returns a result set with links to the datalink
  documents (you can generate those statically if you want, a few
  lines of python, perl, or perhaps STILTS+shell will do).

  The User selects a dataset they want to slice and dice, the client
  retrieves the datalink document and builds the UI from the service
  descriptor.  *Only here* SODA becomes relevant.

This little (but all-important) last step is what SODA is (or at
least IMSNHO should be) about.  

The mess with the service descriptor for a datalink service (i.e.,
turning complicated PubDIDs into Datalink documents) in the
DAL/TAP response, and the Datalink service itself are just an
(optional!) implementation detail to allow services to deliver direct
access URLs to legacy clients while enabling smart clients to still
use Datalink (I'd expect that won't be a big use case with the
large-ish cubes that you want to hide from legacy clients in the
first place), so it's just a confounder at this point.  

The worse mess with a SODA descriptor directly embedded in the
DAL/TAP response is even more optional, and, as I said, probably is
even less useful with cubes in the first place.  Spectra, timeseries,
yes, they can profit from this, cubes, no, I think not.  And we
should make clear that you'll therefore not have that descriptor in
your average cube or ObsTAP service. Because, and I'll not tire to
point that out, it wouldn't work.

> These thoughts lead me to support the James's proposal: Adding to
> SODA an endpoint returning the meta-data of a given dataset.

As I said, the *dataset* metadata is not enough, so that alone
wouldn't help -- you'd still need a complicated scheme to explain to
a machine how to turn dataset metadata and service properties given
in some as-yet undefined way into the parameter metadata.  

However, if you're simply replacing "dataset metadata" with
"parameter metadata", we suddenly have perfect consensus: The
endpoint you're talking about it the datalink service, which contains
exactly the parameter metadata relevant to the dataset in question
(plus a bit more, if so you wish, but that's up to you).

> That makes SODA working in any context as a self-described service.

Exactly.  Which is what I've been talking about all the time.  And
the best thing is: The relevant standards (VOTable and Datalink) are
already there!  We're done!  Yay!

> One can object that this requires a 2 steps query but that is not a matter.

Right.  For cubes it certainly doesn't.  Let's keep forgetting the
silly shortcut with the embedded SODA descriptor.

> It look natural to first know about a dataset before to run some processing
> on it. That is reachable for client not that smart.

Absolutely.

I'll try to write some prose that lays out that reasoning, and in
particular makes clear that the in-DAL SODA descriptor is a shortcut
that's probably not terribly useful for cubes in a branch of the
document; I'd put that in as the next gripe after the xtype one --
any more opinions on that, by the way?

Cheers,

      Markus