[SIAv2] capabilities

Thu Nov 7 02:24:41 PST 2013

Dear List,

On Wed, Nov 06, 2013 at 01:03:51PM -0700, Douglas Tody wrote:
> On Tue, 5 Nov 2013, Patrick Dowler wrote:
> 
> >I have been working on editing the SIAv2 draft, mainly splitting it
> >up into WD-SIA-2.0 which contains the query and metadata
> >capabilities and the WD-AccessData-1.0 which contains sync and
[...]
> We agreed to add a separate AccessData capability (as a separate service
> I think), however it is not clear if this means that there is no longer
> any accessData capability integrated into SIAV2.  If that is your
> intention, then it can only work if this revised AccessData is image
> specific and based upon the ImageDM - otherwise we have no place to put
> the advanced image access capabilities required for large image cubes:
> slice/dice, function computation, data format transformation, etc.  If
> then AccessData is image-specific it is not clear where we put the
> analogous functionality required later for e.g., spectra or time series
> data.

I'd like to dispute that.  With datalink, robust ways of metadata
discovery and conventions involving UCDs all these use cases can
easily be covered by just adding a few (~4) pages to the datalink
specification -- I've still not seen a single use case that couldn't
be satisfied by the mechanisms I proposed in Waikoloa
(http://wiki.ivoa.net/internal/IVOA/InterOpSep2013DAL/datalink-gavo.pdf).

True, compact specifications (as Datalink is now) are nice, but less
specifications are even nicer if they're cheap to get.  And in the
case of processed data services ("accessData" if you will, but
frankly I'm not a big fan of this term), I still maintain all it
takes are those four pages in datalink.  I'd volunteer to start them,
except I'll be terribly busy until early December.

> directly specified by the client.  If stageData is only available as
> part of AccessData it is not clear how information from an acref or
> pubDiD generated by queryData is communicated.  But that is true for

No.  Datalink, at least, is pretty clear about that in its current
draft.

> >1. I defined the {query} capability as a DAI-sync resource that
> >accepts a certain set of params (REQUEST, RESPONSEFORMAT, MAXREC,
> >plus the actual query params). Does anyone feel that we also need a
> >DALI-async query resource for these simple parameter-based queries?
> >(see *)
> 
> Sync is sufficient for this.

Agreed.

> >2. The {metadata} capability returns the complete metadata (as
> >defined by ImageDM) for a single observation discovered via {query}
> >and could also be used with a TAP/ObsCore query response. I have
> >assumed this resource is also a DALI-sync resource and that async
> >is not needed here.
> 
> Sync is all that is required for metadata retrieval.

Agreed as to the sync.  For whether such an endpoint should exist, I
have more doubts.  I belive datalink services should, were
applicable, be able to return "metadata only" (e.g., FITS headers),
and quite possibly SIAv2 could require subordinate datalink services
to offer that.

The big advantage of this would be that if programs understand the
format of the dataset itself, they will probably easily understand
the FITS headers.

*If* we truly believe we can pull off that many people deliver data in
ImageDM (and bear in mind our SDM experiences), then I believe this
should, again, not be SIAv2 specific, but whatever datamodel metadata
there is should be required to be embedded in a datalink response.
That's straightforward if, as is currently the case, the datalink
response is a VOTable, and ImageDM (or whatever  other DM applicable
to the data) were in VO-DML (as it should be).

> >3. As written, it would be allowed for these two capabilities to be
> >the same resource (different values of REQUEST) or two different
> >resources (each supporting their specific value of REQUEST). That
> >is more flexible than previous SIAv2 drafts, in which only the
> >REQUEST value differed.

Frankly, I'm not a big fan of allowing and applying two mechanisms
for what's basically the same thing; in this case: selecting
different "capabilities" (meaning: things it can do, not the
VOResource capability) of a service.  DAL services have traditionally
used REQUEST to do that.  I'd be all for breaking that tradition and
using URL paths ("endpoints"), because:

* No case-insensitivity nightmares any more!  Yay!
* Different endpoints can be described by different VOResource
  capabilities, leading to more natural endpoint metadata
* URL construction is simpler (no more figuring out whether to 
  append URL parameters with ? or & when you don't actually have a
  real URL parser)

However, if we don't want to break with the REQUEST tradition, then
we shouldn't complicate the situation by using one mechanism for some
"capabilities" and another for some other "capabilities".  So: If
REQUEST once (for the record: DALI doesn't require that), then it
should be all REQUEST.

In the run-up to the Waikoloa interop, I went through the 2013-08-12
WD of SIAv2; let me take the liberty to raise the points I had back
then here.  Sorry it's fairly long again.

(1) "automated virtual data generation" -- this is where you put a
"generate links to cutouts"-like parameter into your request, much in
the way must current cutout SIAP services work.

I don't believe we should be doing this.  It complicates the service
interfaces and the spec without much benefit to the client.  Those
services that offer cutouts can easily declare a datalink service
including cutouts and deliver that together with the SIAP response.
Even if you believe you need a special SIAv2 "accessData" service,
this "deliver cutout URLs" thing is more a complication than a
simplification to the client (let alone the service).

In that way, clients know exactly where they can get cutouts and
where they have to retrieve the whole dataset, and they know exactly
when what they get is a cutout and when it's the whole thing. 

Plus: we save two (ok, 1 1/2) pages of spec language in SIAv2,
actually *improving* capabilities.  Isn't *that* a deal right there?

(2) "PQL".  Here's my main point.  I confess this has traits of a
crusade, but I honestly believe "PQL" is bad for service
interoperability[*].

One consequence of doing ad-hoc grammars is that in the table in
section 3.1.3, the datatype of most parameters is STRING.  You might
say that for service-defined parameters (as opposed to extension
parameters) that's not much of a problem since clients know the
semantics.  But even there things become ugly, e.g., if you wanted to
communicate the sensible range of the values in POS, BAND, TIME, or
POL.  And would you really want to write something like

<PARAM name="input:POS" datatype="char" arraysize="*" unit="deg"...

-- a string with unit degrees?

But of course, for POS, there's additional syntax in connection with
the reference system (or maybe even "coordinate system").  Here,
GALACTIC and ICRS are "required" (except when they're not, as in
several cases that are enumerated; please, folks, either something is
required or it's not).  How would a client discover what works there?

My opinion on this: If a service supports  POS (I'd hope this would
be RA and DEC, really), then it must be  ICRS.  If it wants to
support other coordinate systems, it can do so using other parameters
and declare their STC metadata using the (quasi-)standard
STC-in-VOTable mechanisms in the metadata response.  You get full STC
and clear semantics *and* syntax on RA and DEC, all at the same time.
Isn't that nice?

Again, this is my #1 itch with the draft: Please, *please* no more
"PQL".  Or if, then write a strict grammar *and* a way to communicate
legal ranges, syntaxes, option and whatever else you have in PQL in
metadata responses, *and* have clear recommendations on how to do
custom service parameters ("atomic" or "PQL, too").  Otherwise, you
end up where we're now in SSA.

(3) nD POS, SIZE: The spec says it'd be ok to just put as many values
in there as you like to support data with more than two dimensions.
I maintain that simply won't fly, as with more than 2 values it's
utterly unclear what you're constraining: time?  wavelength?
frequency?  In what units?  Maybe spatial coordinates are suddenly at
the end?

No: Again, don't have compound values.  With separate parameters, you
have RA, RA_SIZE, DEC, DEC_SIZE, LAMBDA, LAMBDA_SIZE, and in the
LAMBDA metadata declaration you could even say what unit the service
expects.

Actually, I'd weakly prefer RA_MIN, RA_MAX rather than the value/size
pairs.  But that's really a deliberation between server and client
convenience, so I'll not actually take sides here.

(4) BAND names: The spec allows feeding some strings to BAND, the
idea being that you could say BAND=J,H,K (or maybe even BAND=J/K?)
and the service would then return some infrared data; also VOResource
bad designations ("x-ray") are allowed.

Unfortunately, it's not as simple as that, since we all know how many
band names there are out there and how name clashes are the rule
rather than the exception.  So, on a single service this makes no
sense without a way to discover what band names are supported, and
across services it makes no sense without a controlled vocabulary of
band names and their intended interpretation.  While communicating
the supported band names wouldn't be hard, I believe the vocabulary
won't happen, so let's just strike BAND and have LAMBDA_MIN and
LAMBDA_MAX.

If you don't believe me, try an all-VO SSA query with one of the
VOResource band designations.

Services still could have custom parameters allowing something like
that, of course; but there's no easy way to make these designations
globally work, and for the hard way, I'd say the advantage of being
able to say "infrared" rather than something like
LAMBDA_MIN=1e-4&LAMBDA_MAX=4e-7 (or whatever the scientist's taste
is) is far too small.

(5) in 3.1.3.7, two ways are given to retrieve the metadata.  Given
we don't have to support legacy methods as there are no SIAv2
clients:  Please, let's agree on one way.  Frankly, I'd prefer none
of the two but a REQUEST=getMetadata (if we have to have REQUEST).
Or, even better, do away with this kind of metadata discovery
entirely and say that services should use the capabilities VOSI
endpoint.  I don't really care what it is.  I do care that I don't
have to expose the same information in response to four different
requests in three different formats.

(6) REGION: On the use of STC-S as a service parameter we've already
had our quarrels.  I still believe it's a bad idea and unnecessary on
top of that.  If you insist on having it, though: Please define some
way that lets clients discover what kind of strings are accepted
here, on a service-by-service basis.

(7) FLUXLIMIT: is specified in microJy.  Now, frankly, that sounds a
bit like premature optimization for a particular use case to me.
Also note we're talking about a machine-to-machine protocol here, so
I'd make the point the parameter should be in application-neutral
W/m**2.   That leaves the choice of the unit presented to the user
clearly to the application, which is where it ought to be.

(8) COMPRESS: There's HTTP standard mechanisms that cover that use
case.  Don't reimplement them on another level.  Plus of course, as
written in the spec it's completely unclear what either service or
client writers are supposed to do.  Let's just not have that
parameter.

(9) 3.2.2.3 Recommended Columns for Data Access: I move to remove
this.  I claim datalink does all that's needed here, and it already
defines how to communicate where the identifier to use in these
services is.

(10) 3.2.3.2 Association Metadata: Well, that's a problem haunting
SSA as well: a given file can be part of multiple associations.  This
means  that to store this kind of data, you'd either need a separate
table, or, if you're Evel Knievel, a variable number of additional
columns that are grouped in some way to make clear what type, id, and
key columns belong together.  If that's what we want and we believe
it's worth the additional specification effort, then we should say
what to do (or just write the whole thing in VO-DML and say what
SIAv2 returns is a VO-DML instance document; that might be a relief
anyway).

(11) 3.2.3.3 Multiformat Association: I've never been too fond of
solving the format problem by adding rows to the SSA tables.  Now
that we have datalink -- couldn't we just say "Services should only
return one row per dataset, with an accref to their preferred format,
which should be a FITS or VOTable dataset conforming to the spectral
data model in the latest version recommended by IVOA if at all
possible.  For the benefit of clients preferring or requiring other
formats, an associated datalink service may provide a datalink
service with a parameter FORMAT (UCD: param.format;meta.file)"?

What I'd like most about this: It would let me have PubDID as a
primary key in my SSA tables and yet no client knowing datalink would
be locked out.  And of course clients don't have to second-guess SSA
responses in order to filter out duplicates.

Well, sorry for another long diatribe.  But as someone who's
implemented SSA on the server side and did quite a bit of work on the
client side I feel we should really avoid mistakes we made back then.

Cheers,

         Markus

[*] "PQL" criticism in short: it's underspecified and allows too many
variants without any way to communicate what the grammar for a
concrete parameter is; there are large holes in the spec (string
values with , or ;, sequences of positions, etc), no rules on what
should happen with custom service parameters and hence confusion for
clients trying to build interfaces.  More on the consequences of all
that in the SSAP universe can be found in
http://docs.g-vo.org/talks/2012-urbana-ssapstate.pdf