SODA gripes (1): The Big One

James.Dempsey at csiro.au James.Dempsey at csiro.au
Tue Jan 19 11:35:19 CET 2016


Hi Markus,

I have a suggestion, driven by the idea that SODA params may include some not represented in the obscore data model.

Could a ranges endpoint be added to the SODA interface (perhaps within an {async} resource) which could provide the valid ranges for each param for the ids provided so far?

This would likely work better if the service provides the optional UWS ability to add parameters after the initial POST call to construct an async job. 

An alternative would be a separate ranges resource which takes one or more IDs and returns the parameter ranges.

The advantage I see with this is that it doesn't push knowledge of how SODA works with specific data products outside the SODA service itself, but rather encapsulates that knowledge in the service.

Cheers,
James Dempsey

________________________________________
From: dal-bounces at ivoa.net [dal-bounces at ivoa.net] on behalf of Markus Demleitner [msdemlei at ari.uni-heidelberg.de]
Sent: Tuesday, 19 January 2016 9:09 PM
To: dal at ivoa.net
Subject: Re: SODA gripes (1): The Big One

Dear Colleagues,

Given this has been a discussion exclusively between the current
authors so far, I'd propose to delay some definite decision on "The
Big One" until a few more people had a chance to wrap their heads
around how SODA is intended to work.

Let me nevertheless respond to some of the new points that Pat and
François have made -- there's a TD;DR at the foot of this mail.

On Fri, Jan 15, 2016 at 08:48:54AM -0800, Patrick Dowler wrote:
> I would like to add (remind) that the evolution plan includes a {metadata}
> capability that we nominally said would be part of SIA-2.1 but since it is
> another capability it could be defined there or in a new spec or in another
> spec (eg SODA-1.1). The {metadata} capability is intended to allow clients
> to get the necessary metadata for a single dataset (ID=...) so they can
> figure out how to call the SODA service and take advantage of all the
> features offered.

Having this additional endpoint would indeed solve some of the
problems I see with the current draft.  However, I can only see
disadvantages wrt simply giving proper parameter metadata:

* much more complicated (e.g., linking params with pieces of
  metadata, parsing and  representing the metadata...)
* requires an extra request per dataset
* doesn't help with parameters not covered by the data model in
  question
* [the big one here]: Only works if there is an appropriate data
  model in the first place.  Experience in the VO tells me that that
  is a very big if.

And I cannot see a single advantage over proper parameter metadata
generation, except perhaps:

> Now, that general usage pattern (make a remote call to get metadata) is
> nice an clean but it isn't necessarily optimal if you want to process many

We-eell, I could claim that proper definition of the parameters in an
RPC is nice and clean, too (and not doing it is mean and dirty), so
I'm not sure I'd count that as an advantage of your scheme.

> things the same way. I can understand Markus' idea to define domain
> metadata inline in a SODA service descriptor but it looks a lot like an
> optimisation to me. I' not saying it isn't useful/necessary to optimise,
> but I do not think we should try to do that without having tackled the
> general problem.

Again, rev 3192 is not (really) trying to define the dataset itself.  I
am convninced the latter is a very hard problem, and one we won't
solve in full for a long time to come.

It is about defining *parameter metadata*, which has some
relationship to dataset metadata in general, but that relationship is
neither trivial nor easily expressible.

But even if we had some way to define that relationship: relying on
a full description of datasets would mean SODA wouldn't work outside
of a small niche for a long, long time.

So even if you think proper parameter metadata is a (premature?)
optimisation (I don't), I claim it's unavoidable, and it's certainly
"good enough".  Plus, I've still not heard an actual argument against
it that's actually rooted in technology (rather than philosophy):
What becomes more difficult, less robust, less desirable with domain
declarations on the parameters compared  to a solution where you get
the metadata from somewhere else and then do some magic inference of
the domain?


While I'm writing, and to avoid another deluge of mails, I'll briefly
comment on François' mails:

Let me start with the question of using PARAM/@ref to link params and
metadata items:

On Fri, Jan 15, 2016 at 05:42:40PM +0100, François Bonnarel wrote:
>      - However, the feature you point out in DataLink is not yet used by
> current version of protocols except for ID, which is fully consistent with
> the solution I have drawn. So we could imagine modify slightly the DataLink
> text in next version, if we admit the "ref" mechanism.

That is not true -- clients are expected to collect the values of all
PARAMs with @ref.  I didn't like that requirement myself, and I'm not
sure client authors have picked it up, but it's there, and changing
it after 1.0 would IMHO need a very strong case.

Which this is not, for instance, because it still doesn't solve
parameter metadata for parameters not in the dataset metadata (e.g.,
picking FITS extensions, rebinning, ...). In general:

> What I really want to avoid is having the dataset limits with  the OBscore
> structure in one context and the same dataset limits with the <MIN><MAX>
> structure and absolutly no linkage between these two ways of providing the
> same concept. And what will happen  if people would occur to provide

Again, they *do not* provide the same concepts.  One is dataset
properties, the other properties of the pair of (service, dataset).
There's any number of things that can happen to the dataset
properties, even plain ones like a wavelength range -- perhaps I
won't let you cut out near the ends of my spectrum?  Perhaps there's
additional pixels for calibration that I can show you in a datalink
parameter?  And again, there's a wealth of parameters not even
represented in dataset metadata.

These are two different things.  Conceptually.  Therefore, there is
no repetition, and trying to make the different things look the same
because there are a few cases it it *seems* they are the same is
going to make the protocol cumbersome,  complicated and unflexible --
something rooted in a faulty theory will in general be painful.

Then, on whether proper parameter metadata is required in version
1.0:

On Fri, Jan 15, 2016 at 05:07:12PM +0100, François Bonnarel wrote:
> If after discussion and implementaion people want teh <MIN><MAX> (or the
> latrenative solution) it would be possible to add them without discarding
> old services which will only MISS something usefull (and the same for a

EXACTLY my point: This is not about services, this is about clients.
The clients written against the editor draft won't do anything useful
with the services that would let them do useful things.  I know I'm
sounding like a broken record, but we simply MUST design our
standards much more from the client perspective; client uptake is
what makes or breaks standards, what makes or breaks the VO.

So, we have to make our design such that 1.0 clients will be able to
usefully work with all 1.x services.  As they should.

Then, when I was talking about retrieving SODA descriptors from
datalink documents:

On Fri, Jan 15, 2016 at 04:56:45PM +0100, François Bonnarel wrote:
> I don't understand this. From the DataLink and SIAV2 specs there is really
> two different ways you can be driven from discovery response to SODA
> interface.
>      One is the one you describe and which CADC is indeed using. the acref
> field in the Obscore table contain the URL of the {link} table. In that case
> the format field is marked as "DataLink". But it's not "typical". It is just
> one of the two ways.

...and both need to work.  Which means you need to be able to derive
parameter domains from the datalink document.  If you grant that, the
question is: do we want to invent a way to embed dataset metadata
into datalink or perhaps wait until {metadata} comes around and then
*still* hope someone finds a reliable way to do that derivation (as I
said, I'm convinced there is none)?  Or do we just do the simple
thing, which is provide useful parameter metadata in the first place?

Incidentally, as the guy that did the design it I still feel entitled to
say the DAL-attached descriptor was designed for a few special
applications, and the general case is a per-dataset descriptor.  But
ok, that's personal feeling, and has no impact on whether or  not
per-dataset needs to werk.

Then, on us trying to understand each other's confusion:

On Thu, Jan 14, 2016 at 10:17:10AM +0100, François Bonnarel wrote:
> On the other side I think it would be an error to put this domain metadata
> in the {link} resource response. (what you call the "Datalink document"). It
> will require several "SODA service descriptor" sections if we have several
> datasets and could be much more complex if we add other kind of services
> (future Standars or custom services). It could even become a mess if we have
> several services on several datasets

I've said before that I was skeptical about allowing several datasets
per datalink document from the start, and since my XSLT-over-datalink
experiments I'm now convinced we shouldn't have done it, but be that
as it may, yes, you will have several descriptors per respose
document.  I see no problem with this.

Conceptually, the tuple (SODA service, dataset) is quite similar to
the tuple (SSA standard, data collection): Since a SODA service's
parameters can change with the data set (e.g., POL might be supported
only for a few datasets served through a given service) much like an
SSA service will have different parameters depending on what spectra
are in there, these *are* different services, and you're doing the
clients a big favour if you don't try to hide this.

Try drawing up the logic a client would have to go through if you
were to make your worst-case scenario (multiple services per dataset,
multiple datasets per service) implicit (leaving aside the questions
just how that would look like).  Uh...


 _____ _       ____  ____
|_   _| |    _|  _ \|  _ \
  | | | |   (_) | | | |_) |
  | | | |___ _| |_| |  _ <
  |_| |_____( )____/|_| \_\
            |/

The way I see things, I claim the editor draft cannot work for many
important use cases because it relies on some implicit relationship
between service parameters and dataset properties, and there's no
realistic hope to make this relationship, or even the dataset
properties themselves, explicit in an interoperable fashion in the
next couple of years.  Hence, we should simply do the straightforward
and easy thing: proper parameter metadata generation.

My colleagues believe some of these unfulfilled use cases are not
important or not within our remit, and anyway the relationship between
dataset and parameter metadata is either trivial or will at least be
interoperably expressable in the near future.

Since I don't see how to reach a compromise here, I propose to
revisit the question later, when there's a wider community
understanding of the issues involved, and perhaps someone else has
started a SODA client, too.  And meanwhile to turn to some other
warts of the current draft, which is what I'll do towmorrow.  Ok?

   -- Markus

PS: Hints on how to engage the wider community, in parrticular from
that wider community, are welcome.


More information about the dal mailing list