SODA gripes (1): The Big One

Tue Jan 5 08:51:06 CET 2016

Dear Colleagues,

On Thu, Dec 24, 2015 at 03:40:21PM +0100, François Bonnarel wrote:
>      This close-to-christmas email to announce that SODA1.0 (previously
> known as AccessData1.0) WD has been released last monday. See :
> http://www.ivoa.net/documents/SODA/20151221/index.html
>       There has been very long discussions among authors and we made some
> progress in convergence. However there is still points hardly debated. This
> is my responsability of editor as well as DAL chair to provide now a version
> which is regarded as insufficient according to some of us but is nonetheless
> fulfilling the CSP and community basic requirements according to me.
> Probably the discussion will start very soon on the DAL mailing list.

Indeed -- there are of order 10 topics on which I'd like to see
discussion on this draft (there's a list of them at the end of this
mail).  

I'd like to start with the Big One (this is probably also going to be
the longest mail in this series; please indulge me).  In one
sentence, it's

  The protocol must be written such that clients can work out what
  parameter values will probably yield useful results.

This, in my opinion, is really the make-or-break thing, i.e., what
decides whether what we write will actually be useful as a generic
access protocol, or whether it will be a source of constant annoyance
all around[1].

So -- even if you have only marginal interest in SODA, and even if this
is a long mail, please take a few 10s of minutes to try and make up your
mind based on the two drafts mentioned below.  You'd have my blessing to
ignore the remaining SODA discussions if you are so inclined.

The premise above applied to SODA becomes: All parameters (except for
the oddball POS, which really has a special position; but I'll revisit
that in a later mail) must be fully declared by the service (including
VALUES and OPTION elements as appropriate) and be systematically
discovered for UI/API generation by the client in a SODA exchange.

I've written standards prose for that already that I think is about what
a standards document can do to mandate such practices (of course, this
is largely a matter of implementation style, which is hard to regulate).
It's been in the text in volute rev. 3192 [2]; for your convenience, I
have built the document as of that revision and put it on
http://docs.g-vo.org/SODA-r3192.pdf.  The contentious prose starts at
page 8 -- if you'd be so kind as to read sect. 2.6 ("three-factor
semantics", 4 pages).

You can comapare with sect. 2.6 as published (the published version is
in effect volute rev. 3200, in case you'd like to see a diff).  Let me
again bambi-eye all around and ask everyone with even a remote
involvement in the cube thing to try and make up their minds and speak
up, even if this thing appears a bit complicated at first, in particular
because, in a way, it's really part of datalink and cannot be understood
without it (I've argued it should really have been part of datalink in
the first place).

If there's anything we can do to help comprehensability, let us know,
too.

Meanwhile, allow me to once more try to argue why it is so important to
urge services to provide consistent, dataset-specific metadata and the
clients to use it in SODA.

SODA is designed to operate on concrete datasets -- you've discovered
something that looks like it might be interesting, but you're only
interested in a small part or a particular mogrification of the dataset,
so your client gets information on the dataset and then figures out what
to do to retrieve the information relevant to you.  This means that you
cannot just put in some value into a service parameter and watch what's
coming out -- you'll almost always get nothing back because the coverage
of a typical dataset is small and not easily predictable.

The "horror vacui", the dreaded moment in GUIs when an input field is
displayed and users have no idea what to put there, with SODA therefore
isn't a minor usability issue, it's a protocol killer.

It has been put forward that clients could infer the domains of the
parameters (the "good" values) from a previous discovery query (e.g.,
from SIAv2, they'd know the spatial and spectral coverage).
Unfortunately, this line of reasoning is flawed in at least to respects:

(1) The results of the discovery query might not be available to the
client dealing with the SODA descriptor

(2) This technique breaks down with the first custom parameter (is the
corresponding item in the discovered metadata?  And what does the
parameter correspond to in the first place?), and that would, again, be
a killer for SODA's usefulness.

Let me dwell on both points for a little while.

Ad (1).   I expect the most common source for SODA descriptors will be
Obscore (and it's a CSP-official usecase in case you don't agree).
There, the access URL for cubes and other large datasets won't be the
dataset itself, because you don't want people blindly pulling several
100s of gigabytes (or just one gigabyte, really).  Instead, you return a
datalink document, which contains the SODA descriptor.  We at Heidelberg
already occasonally do that, the CADC has datalink documents throughout
IIRC (although I think they don't have custom SODA descriptors yet).

To query Obscore, people typically use TAP, and their queries  will
fairly typcially not be just "select * from" but very possibly rather
something like "select access_url, target_name from ivoa.obscore
join...."  Hence, a client doesn't have access to the obscore metadata,
and even if it had, it might have a hard time recognising it in the
possibly wide result tuples coming back from the database.

Another scenario in which dataset metadata possibly obtained during
discovery would get lost is when sending the datalink document (URL)
through SAMP.  Whether we like it or not, our users love SAMP more than
anything else we've come up with so far, and telling them SODA doesn't
play with SAMP isn't going to make SODA popular.

Ad (2).  The dataset operations that data providers will want to enable
through SODA are essentially endless -- rebinning, renormalisation,
format conversion, "logical" cutouts (e.g., on selected extensions
only), etc.  Making SODA something that (to some extent) works with a
select set of standard parameters but fails (in the sense of: client
behaviour is unpredictable) as soon as a service needs a bit more is
going to render it almost useless, and data providers will keep doing
things through custom web pages.  It's the situation we have with SSAP;
although that, as a discovery protocol, at least can limp along to some
extent.  SODA, as an access protocol, wouldn't even limp.

So, we need to say: "A well-behaved SODA client will do X any Y and
*not* ignore Z" to give data providers the confidence that independent
of the client their users choose they still see whatever operations they
consider important.  That's what I've tried in rev. 3192 section 2.6.

As an additional indication that full metadata in the SODA descriptor is
a very good idea, let me mention in passing that 

(3) it would enable usable interfaces in stop-gap XSLT-based datalink
interfaces (as discussed in Sydney,
http://wiki.ivoa.net/internal/IVOA/InteropOct2015DAL/datalink-xslt.pdf)

Just so nobody can't say later I didn't warn them: Yes, this means that
the datalink document that contains the SODA descriptor has to be
tailored for each dataset.  But that's really not a big deal, because
the datalink documents themselves vary with dataset (well, typically) --
previews, plots, provenance, whatever all depend on the dataset.
Dropping in the limits into the SODA descriptor in addition at least for
me hasn't been a major additional implementation burden.

That's it for my first SODA gripe, and thanks for making it here.  I
plan to have, roughly weekly, additional SODA gripes, one after the
other to allow productive discussions on each point.  To give you an
idea what I have up my sleeve here's a tentative programme:

(2) Spatial coverage discovery and the RA and DEC parameters
(3) Pixel coutouts: PIXEL_n
(4) Mandated multiplicities considered harmful
(5) Behaviour for no-ID queries?  For queries with only ID?
(6) No gratuitous xtypes
(7) POS doesn't have an xtype
(8) Examples stuff: example example, and perhaps a dl-id term?

If this sounds scary, don't worry -- this kind of thing has IMHO worked
great for datalink.

Cheers,

            Markus

[1] Incidentally, it also coincides with my conviction that in protocol
development in the VO, we should be thinking much more than in the past
from the client perspective, even if most of the protocol developers sit
on the server side.

[2] To get the source from the repository, use something like

svn co -r 3192 https://volute.g-vo.org/svn/trunk/projects/dal/SODA