SODA gripes (4): Mandatory multiplicities considered harmful

Thu Feb 11 00:13:28 CET 2016

TL;DR - yes, implemented async SODA explicitly to support multiplicity and
multiple results

Long version: I am working on  a SODA prototype right now and I have
implemented the async mode which computes cutouts for the product of all
input ID, POS, BAND, TIME params (POL is multi-valued but a  single set of
states).

For example, if you pass in one ID and 3 POS values you get 3 cutout
results (well, see below). If you pass in an ID, 3 POS values, and 2 BAND
values you get as many as 6 results. A "result" is simply a url stored in
the results section of the UWS job. In our case I don't actually move any
bytes or operate on files until the caller uses one of the URLs, but that
is an implementation detail.

The motivation for the async functionality is that with the more common use
of "object store" systems we may be moving to an era where we can't run
astronomy software where the data is stored and we have random access to
it... we may have to move the file(s) someplace else and then operate on
them, in which case async with multiplicity can be much more efficient.
Right now, we do have CANFAR users that use our data access services from
processing and they will do the extra work to use "bulk operations" such as
async SODA because it is easier for them to manage and it is more efficient
than the 100s or 1000s of requests per job they would otherwise do.

note: The multiplicity in DataLInk is also there strictly as an efficiency
mechanism. You can enforce MAXREC=1 and indicate with an OVERFLOW if you
don't want to do it. But with SODA if you don't want to support
multiplicity just don't provide an async endpoint.

Pat

And now the "see below": well, UWS doesn't provide much in the way of
support for jobs that produce multiple results and partial success in doing
so (aka no support at all). If input params are invalid I just make the
whole job fail, but i have no way to report that:

- some combinations of ID and "bounds" do not intersect the target data
- I failed for some (transient) internal reason
- I can't provide some results because of authentication/authorization
requirements

In sync you have suitable HTTP status codes for all of these.

On Tue, Feb 9, 2016 at 5:33 AM, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Dear DAL folks,
>
> While I'd still appreciate comments regarding the proposed
> explanatory chapter -- see Gripe (3),
> http://mail.ivoa.net/pipermail/dal/2016-January/007281.html --
> (and I still suspect it's useful to skim over that stuff to
> understand what's being discussed here all the time), here's my next
> gripe (it's not so much time until Cape Town any more).
>
> There's a TL;DR below.
>
> This is about mandating parameter multiplicities.  In case you were
> wondering, this means text like this:
>
>   In general, filtering parameters are single-valued in \{sync\}
>   requests and multi-valued in \{async\} requests (exceptions noted
>   below). When multiple values of filtering parameters are used in an
>   \{async\} job, each combination of values produces zero or one
>   result.
>
> and then, nevertheless, for every parameter, stuff like:
>
>   The POS parameter is single-valued for \{sync\} requests and
>   multi-valued for \{async\} jobs.
>
> I propose to strike all such language.  In a section on general rules
> for parameters, there could be text like:
>
>   This specification does not constrain the behaviour of services in
>   the presence of repeated parameters.  For enumerated parameters
>   (i.e., those with \xmlel{OPTIONS} in \xmlel{VALUES}), clients should
>   display widgets allowing the selection of zero or more of the
>   options available.  Services must therefore not fail when receiving
>   multiple values even for single-valued enumerated parameters and
>   discard all but one of the parameters passed.
>
> Yes, it's suboptimal (but see below), but I think we can't really do
> better at this point.
>
>
> Rationale:
>
> Whether or not it makes sense for a service to accept repeated
> parameters (i.e., stuff like OBJECT=alp%20Cyg&OBJECT=bet%20Cyg) is
> highly dependent on the service and on the nature of the parameter.
> If we try to mandate behaviour in the standards text, we'll only
> generate lots of non-compliant services.
>
> Also, the implementation effort typically increases fairly
> significantly when handling sequences (for my datalink
> implementation, it was about 1.5 when allowing multiple values of ID
> in; in the datalink XSLT client, dealing with the results of multi-ID
> queries is still unsolved; the multiple-ID rule in Datalink precludes
> using pre-generated files to serve responses[1]).
>
> So, we should have a very good idea why we want this, and I don't
> think we have that.  Indeed, given the wide range of SODA
> applications (whether already operational, or specified, or
> envisaged), I think we cannot.
>
> As the existing language (see above) on what to do in the presence of
> multiple multiple values -- e.g.,
>
>   POS=CIRCLE 1 3 3&BAND=3e-7 4e-7&POS=CIRCLE 4 5 3&BAND=1e-7 2e-7
>
> -- shows, not even the semantics are straightforward (guess what this
> does, then try to figure out what really should happen according to
> the standard (I don't believe there's a service doing this right now,
> though)).  In that respect, I think *if* we really want "batch
> processing" in SODA, we should go for a much more straightforward
> way: upload a VOTable with one set of parameters per line.  No
> combinatorial explosion, minimal specification effort.
>
> But I doubt all this is even very useful as specified now -- the
> plan, if I understand correctly, is that the results of such batch
> operations would appear as separate results in a UWS document (this
> would need much more explanation if we really go there).  That,
> however, means that there's still one request per processed document,
> so the actual savings in overhead or whatever are probably fairly
> small.
>
> So: It's not evidently useful, certainly not necessary to cover the
> CSP requirements, I don't think anyone has implemented it, it's hard
> for the clients.  Let's simply not say anything (except the very
> general language proposed above) without serious prototyping.[2]
>
>
>
> However (additional proposal):
>
> While I think mandatory multiplicities are a pain that will lead to
> massive non-interoperability if it were ever taken up, I think it'd
> be really useful if services announced which of their parameters can
> actually be repeated.  That's important for clients to really produce
> widgets properly guiding the user (e.g., only allowing one selection
> for FORMAT but allowing multiple selections for OBJECT).  This could
> also be a basis to allow multi-cutouts where they can usefully be
> implemented (perhaps turning a long spectrum into a short SED, or
> something producing an archive of little things).
>
> In an ideal world, we'd have PDL with sufficient capabilities
> formulated in VO-DML ready now.  That would be enough to have an
> expressive and (for machines) easily interpretable annotation and
> would solve several other problems with annotating parameter sets
> (e.g., "if you give a range for PIXEL_3, you cannot give a range for
> LAMBDA").
>
> With a bleeding heart I'll concede that's something we'll have to
> postpone to version 1.1.
>
> While I'm sure a proper parameter DM is where we need to go, even now
> we could, as a stopgap measure for this relatively important use
> case, prescribe some ad-hoc annotation for repeatable params.
> Looking at the VOTable spec, I'd say there are four relatively
> non-destrucive ways we could do this:
>
> (1) hog the utype attribute of the param
>   <PARAM name="OBJECT" ... utype="temporary:repeatable"/>
>   (this would be my favourite; I don't think PARAM/@utype will be
>   used for anything else in future versions of SODA; even when VO-DML
>   still used @utype, "legacy" utype attributes were left alone)
>
> (2) use an immediate group:
>
>   <PARAM...>
>   <PARAM...>
>   <PARAM...>
>   <GROUP utype="temporary:repeatable">
>     <PARAM...>
>     <PARAM...>
>  </GROUP>
>
>  (that's a bit of a pain for the service)
>
> (3) use group referencing:
>
>   <GROUP utype="temporary:repeatable">
>     <PARAMref ref="a"/>
>     <PARAMref ref="b"/>
>   </GROUP>
>   <PARAM...>
>   <PARAM id="a"...>
>   <PARAM...>
>   <PARAM id="b"...>
>   <PARAM...>
>
>  (that's a bit of a pain for the service)
>
> (4) use LINK
>
> <PARAM ...>
>   <LINK content-role="adhoc-annotation"
>     >ivo://ivoa.net/std/SODA#repeatable-param</LINK>
> </PARAM>
>
> I'm not terribly smitten with any of this.
>
> So, my preference remains for someone to fix up VO-DML and PDL for
> version 1.1.  When there's no solution for the multiplicities problem
> in 1.0, perhaps there's more pressure to actually make PDL-in-VO-DML
> happen.
>
>
>
> TL;DR: Services should have the right to decide on multiplicities
> themselves.  It'd be nice if we gave clients some way to figure out a
> given service's decision reliable, but I suspect we've been too lazy
> these recent years in VO-DML and PDL to make it happen properly for
> 1.0.
>
>
> Cheers,
>
>          Markus
>
> [1] You may guess that I'd much rather get rid of the multiple-ID
> thing in datalink services.  That's true.  I'll shout as 1.1 comes
> around.
>
> [2] As far as I am concerned, we could simply strike async completely
> and be done with it.  I don't think anyone could implement async
> based on what's in the spec.  But that's another thing, and I don't
> have it on my agenda right now.  Has anyone really tried async
> SODA?  I'd be curious to compare if we came out with the same
> choices...
>
>
> PS: Preview on future gripes (sequence TBD):
>
> () Spatial coverage discovery and the RA and DEC parameters
> () Pixel coutouts: PIXEL_n
> () Behaviour for no-ID queries?  For queries with only ID?
> () POS doesn't have an xtype
> () Examples stuff: example example, and perhaps a dl-id term?
>
>

-- 
Patrick Dowler
Canadian Astronomy Data Centre
Victoria, BC, Canada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20160210/d15c6254/attachment-0001.html>