SODA gripes (1): The Big One

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Tue Jan 12 15:25:12 CET 2016


Dear DAL,

I'll try to reply to several of the contributions of the last week at
once; think the threads are close enough to merit that, although it
means that this mail, again, is a bit on the long side; but having it
all in one narrative perhaps saves you time, and it's again
essentially all on: Domain metadata or no domain metadata?

First,

On Tue, Jan 12, 2016 at 06:57:12AM +0000, James.Dempsey at csiro.au wrote:
> Parameter ranges are really useful, and one of our early testing tools
> was a page which has RA/Dec entry fields that default to the centre of
> the image cube to be processed. However to me aggregate ranges seem a
> lot less useful, e.g. a range covering three cubes with narrow
> spectral ranges that are widely spaced from each other will leave
> plenty of room for empty result sets. The reference values for a data
> product are in the ObsTAP/SIA2 response and I'd not like to
> duplicate them elsewhere. Thus I'm in favour of the current draft
> text over Markus' suggestion.
> 
> Note: This is based on the assumption that a client app would have to
> be ObsTAP/SIA2 aware to use SODA.

Right -- and as I pointed out, even ObsTAP doesn't necessarily help you
because there's no guarantee that the evaluating application has access
to all Obscore columns.

So, I keep maintaing it would be an error to restrict SODA to a "full
metadata known" scenario, in particular because I expect it will not be
unusual that the link between parameters and the relevant pieces of
metadata is not known to the client (and as I said, custom parameters
will be all over the place, as they are for SSA today.  Only more so).

As to duplication -- note that even in the SIAv2 case with full metadata
availability, there are two use cases.

(1) the user selects a single dataset.  In that case, a model-aware
client would need to fill parameters in the DAL-embedded service
descriptor from dataset metadata as good as  it can (i.e., for those
that it really knows).

I'd maintain that's not a good practice, as that is error-prone, and the
client should rather retrieve a datalink document.  The datalink
descriptors embedded into DAL responses aren't really suited for
single-dataset access, exactly because the client has a hard time
figuring out what custom parameters correspond to which pieces of
metadata, if there's such a correspondance in the first place.

(2) the user wants to do multiple cutouts.  This is where the aggregate
limits become important.  If you want, you can already try this with
recent versions of splat (even if the UI to SODA on published versions
admittedly is ugly) -- on SODA-enabled services, you can, for all
spectra, say you'd like a certain spectral region and a special format
(due to a bug in the published versions, you'll have to use that later
feature to request FITS results if you check it out).  With that, you
can retrieve *multiple* spectra processed in the same way.  The ranges
(which published versions of splat show when mousing over the input
fields) in these cases again have to come from the service, as again the
relationship between result columns and parameters is hard to declare.  

Even if some of the results will be empty because of the orginal
dataset's coverage, this possibility to process multiple datasets in the
same way is eminently useful, e.g., if you only want to retrieve the
immediate vicinity of H alpha (or whatever) -- and that is what the
in-DAL service definitions were really intended for in my early
proposals.  But it's something completely different from exploring,
slicing and dicing and individual dataset.  In particular, it
presupposes a fairly intimate knowledge of the data collection you're
working on.

So, I think we should keep domain definitions even in in-DAL service
descriptors (but it might be wise to add prose explaining what they're
intended for: they're shortcuts to mass processing).

In the datalink-embedded service descriptors, I still think there's no
actual alternative.

> Perhaps table 2 could be expanded to list the ObsCore fields that
> define the range for the parameter, or those could be included in the
> parameter???s subsection?

Again, that's only helping if we restrict SODA to operating when there's
an ObsCore definition present and only on concepts present in ObsCore.
I'd claim that's unnecessary, and it's actually much easier for the
client (because otherwise it has to gather together limits from wherever
some metadata may be located) and not noticeably harder to the server if
we're explicit about the domains.

Because it fits here, let me drop in my PARTISAN CONCLUSION here
already: I see a choice between a very specialised protocol that's hard
to use and a general protocol that's easy to use, all hinging on the
proper declaration of metadata, in particular the domains.  

Of course, I may miss something what's not to like about proper domain
metadata -- if so, someone get a cluestick.

> One related observation ??? in sections 2.6.1 and 3.2.2, BAND has a
> UCD of ???em???. Should this instead be ???em.wl??? to provide an

That's already fixed in SVN (rev. 3203) -- I just didn't get around to
repairing it before the Dec 24 release.

> exact match with the ObsCore em_min and em_max fields and be clear
> that it is a wavelength? This will help client apps to make the link
> and will guide users such as radio astronomers who work more often in
> frequency terms.

...where of course clients should allow users to use their
domain-specific units, so hopefully this won't be that much of an issue.

Then, on to Mark's mail:

On Fri, Jan 08, 2016 at 10:39:58PM +0000, Mark Taylor wrote:
> Sec 1:
>    Most of the use cases in sec 1 are labelled "will be developed
>    and supported in [a later SODA version]".  Does this mean that
>    this version of SODA is only targetted at simple (POS/BAND/TIME/POL)
>    cutouts?  That's fine if so, but it would be helpful to note that

Hm, ah well, I'd claim it's not fine if so, because that'd lead client
development into a harmful direction where they ignore the service
descriptors and just run based on Obscore results.  Which would put SODA
to where SSA is today: barely working for the simple cases, a matter of
finger-crossing everywhere else.

It's (almost; I'm not a big fan of announcements in standards) fine to
say "standard parameters to do these other things will be defined
later", but I'm sure we can write the standard now in a way that clients
written to 1.0 will work fine with more capable services, possibly
adhering to later standards -- essentially by three-factor-semantics and
proper metadata generation and usage.

> Finally (at least for now), it's not obvious to me from this document
> how to actually use a SODA service.  Possibly that's because I'm
> not familiar enough with Datalink or other associated standards,
> but I may not be the only one...  Presumably (in view of the

I agree this needs better explanation -- have you had a look at my rev.
3192 build at http://docs.g-vo.org/SODA-r3192.pdf, section 2.6?  I make
an effort to explain the information flow there, as that is really
important to understand why the protocol really hemorraghes usefulness
when we don't mandate parameter domain definitions.

On to François' mails.

On Tue, Jan 05, 2016 at 06:47:33PM +0100, François Bonnarel wrote:
>       a ) It is true that the main point of discussion is about the
> descriptions of the PARAMETER domains mainly when it is not directly
> available in the client (for example via the metadata provided by the
> discovery phase). And also that in the case of custom parameters (as well as
> it would be for custom services parameters) there is nothing that could be
> discoverable.
>       b )  My point is that it is possible to postpone the solution of that
> use case FOR NOW for three reasons:
>             1 ) The current draft allows to fulfill  the basic requirements
> of the CSP  in 95% of the cases. We can wait next version of
> ObstAP/DALI/SIAV2 and SODA to solve the remaining 5%. This is the point I

The 95% are conjecture, and I dispute them.  On my end, 100% of the cases
require full domain definition (spectral cutouts from splat, and
XSLT-processed datalink in the browser).  Which future SODA clients will
have what discovery metadata available nobody can start to predict.  But
I think it's easy to agree that either way they'll have a much easier
life if we're explicit from the start, as they won't have to have
complicated metadata mapping schemes just to discover what the service
can simply tell them from the start.

> features. This includes proposing a concurrent technology for describing the
> domains  as we have allready the description in the Obscore table. This also

But it doesn't, and there may be no relation of parameters to the
obscore items in the first place, in particular not for custom or future
parameters.  Even if there were, there is no way to declare that right
now, and inventing one is much more complicated than doing the right
thing (proper metadata declaration) from the start.

So, there is do duplication of information in reality.

And again (just to be on the safe side, although François stressed so
himself): This doesn't help *at all* in the, IMHO typical, use case
where a client looks at a datalink document.  The current WD simply
completely breaks that use-case, and I'd argue needlessly.

>             3 ) the current draft is totally open on future evolution on
> this point. It may be consistent with the solution proposed by Markus and

Unfortunately, it's not.  Once the first clients are out and it becomes
clear that they're not useful for what people want to do with their
services, they'll keep developing web interfaces, and SODA will go the
way of <insert your favourite non-taken-up IVOA standard>, and people
will screen scrape and type into web forms for the next five years at
least.

> B ) This is now a reminder of the CSP priorities. Remember Data discovery is
> done via ObsTAP 1.0 (1.1 soon) or SIAV2.0. Both are IVOA recommendations
> now. DataAccess and cutout is done via acref field in query response (full
> download) or SODA service. SODA service is referred from the Discovery

No, as I pointed  out above, the typical way should be to first retrieve
a datalink document for a discovered dataset, and work on this -- and
actually, this is what the CADC does in its obscore service throughout
already, and everything else (i.e., working from a service block) is
shortcuts for special situations (e.g., mass cutout of the vicinity of a
spectral line).

> C ) With the current recommendations and the  SODA WD as it has been
> proposed by the WD editor what can be implemented by data services. How IVOA
> applications ( service clients) can manage with that and serve the end-user
> needs ?
> 
>     a ) You MUST build a SIAV2.0 service or an ObsTaP service dedicated to
> your data cubes. Or both.
>     b ) You MAY build a DataLink service providing resources attached to the
> data cubes
>     c ) you MUST build a SODA service providing cutout facilities for your
> data cubes
>     d ) the SODA service SHOULD be refered from the SIAV2.0 or ObsTAP
> response via a service descriptor (with appropriate reference to the
> publisher DID column) (case d1). Or it SHOULD be refered in the DataLink
> resource response (if it exists) with appropriate reference to the iD column
> in this response (case d2).

Uh -- this looks scary.  Before anyone panics, can't we simply say:

(1) You build an Obscore serive, if you want add SIAv2 glued on top.
There's several usable TAP engines out there that you can use, so that's
relatively easy to do.

(2) For each dataset, you generate (either pre-generate or generate on
the fly, which would be a datalink service) a little VOTable that
describes access options.  This is what you let your Obscore table point
to.

(3) If you run SODA (e.g., for cutouts), this little VOTable also
contains a description of how to operate it for the dataset in question.

Much less scary, straightforward in implementation, regardless if you're
a large or a small provider.

>       If a client is not smart enough to manage Discovery service querying,
> SODA service interface, DataLink response display and interpretation and
> eventually data cube visualization, the end-user may use several combined
> applications communicating via SAMP. This point doesn't make any difference
> as long as all applications are run on the same Deskop

It does make a huge difference (works so-so vs. doesn't work at all)
unless we have full parameter metadata, because otherwise you'll have to
transmit the discovered metadata *together with* the datalink document.
We don't have a technique for that, and developing one is a much larger
pain than simply doing the right thing in parameter definitions

> PARAMETERS (except blindly which may be reasonable as a fisrt step). But I
> think the basic CSP requirement are filled using the current draft.
> Refinment and sophistication will come in next version  and  could adress

Again: We can do something complicated that may just barely fulfil the
basic CSP requirements in some special scenarios, or we can do something
simple that fulfills the CSP requirements even in the presumably common
case of a datalink transmitted over SAMP.  Hence, I don't think the CSP
requirements can serve as a guideline to choose here.

And then on to François' last mail:

On Thu, Jan 07, 2016 at 04:26:18PM +0100, François Bonnarel wrote:
> C ) the {link} resource of the DataLink spec is working like a glue between
> datasets and additional resources such as fixed links or services applied on
> a given dataset. It contains external descriptions of the links and

True.

> resources, and of services input PARAMETERS. It should not contain
> description of the dataset themselves which is the work of discovery
> services or accessData or server side processing WEB services ( as SODA is
> intended to be), in order to avoid confusion between the role of each module
> in the whole DAL scheme.

I don't think I understand this argument.  Whose confusion are you
worried about?  Why should the description of the dataset be the job of
discovery services?   Of course the dataset itself contains its
metadata, and I don't think anyone was ever confused by encountering WCS
information or the image size in a FITS header.

On the contrary: As you say, the datalink document for a dataset
accessible through one or more SODA services will contain their
parameter metadata.  An important part of this is the domains these
parameters admit.  That these may or may not correspond to properties
of the dataset in question goes without saying -- how could that ever be
confusing?

> with Markus approach of the input PARAMETER domain metadata issue (see
> http://docs.g-vo.org/SODA-r3192.pdf  ,section 6 for his views and compare
> with the same section in the editor WD).

Just to be on the safe side: François of course is talking about 2.6
(fortunately, there's no section 6, so hopefully there's been no
additional confusion).

> I propose a mechanism which I think is more consistent with what we allready
> have and the general DAL architecture. However I don't wnat to push it now
> in the WD and in the spec, because I Think we have time to discuss these
> matters until the next version of SODA, SIAV2 and DataLink. In my first

On that I think we should really try hard to avoid putting forward a
standard with the express plan to, in all likelihood incompatibly,
invalidate it right away.  If I were an outside implementor I'd say that
VO standards authors must have lost their minds when confronted with
such a proposition.  Planning for growth and custom features is good,
announcing immediate obsolescence is not.

> email I tried to convince you that we allready have, without that "domain
> metadata" feature a workable spec to fulfill the basic CSP spec.
>    The solution is based on the inclusion of "ref" attributes in the service
> descriptor PARAMETER elements for all the standard input PARAMETERS. ref to
> the appropriate Obscore FIELD/PARAMETER or GROUP of FIELDS/PARAMETERS. This
> can be done in the discovery service response, or in the response given by
> the SODA service queried with the unique ID="dataset_id" constraint. Let's
> see how it can work with examples in E and F.

I'm tempted to remark that this kind of double referencing is pretty
heavy stuff just a avoid what I've argued above isn't actually
repetition in the first place, and that of course it doesn't solve
operating SODA from datalink (or, equivalently, via SAMP).

But the bigger problem with the proposed mechanism is that it breaks (or
incompatibly overrides) what we have in datalink:

  Although this version of DataLink only has one parameter (ID), using a
  GROUP and providing the service parameter name allows this recipe
  [parameters being filled from what its definition @ref's] to be used
  with any service and (with the GROUP) with multi-parameter services.

(http://ivoa.net/documents/DataLink/20150617/REC-DataLink-1.0-20150617.pdf,
PDF page 21).  So, if SODA now proposed using @ref for pointing to
relevant pieces of metadata, we'd have to explain when to immediately
fill the parameter as per Datalink and when to use the ref'ed value as a
hint as per SODA.  I don't like it.  At all.

Well, thanks for making it here.  Sorry for being so verbose, but I'm
really, really worried we're messing up our one chance to have a widely
adopted *and* implemented (on the client side, primarily) protocol for
SODA: server-side operations on data.

Cheers,

               Markus


More information about the dal mailing list