STC-S (with a view to DataLink)

Mon Jun 24 02:57:33 PDT 2013

Dear DAL list,

For those just coming in or wondering what the fuss is about, see a
little example close to the bottom of this mail.

The fact that nobody spoke out in favour of atomic parameters so far
is quite a heavy downpour on my parade, not to speak of my thunder
the unexplained disappearence of which I regret.

Still, since I believe this is an important choice and I'm really
worried by the SSAP precedent, I'll try again once more, and again
with a diatribe bespeaking my secret love for the  humanities, at
least through its length.  If then, still, nobody shows signs of
starting to agree with me, I'll shut up, ok?

(All quotes from mails that went over the DAL list in the last few
days)

So: 

What's this about?
==================

(those wanting to look at a concrete example, see below)

François said:

> I would say it's not "STC-S in Datalink" but something like "STC-S
> in cutout services and access data methods".  This kind of services
> and methods will be part of the ressources Datalink will attach to
> Dataproducts indeed, but according to the discussion during
> Heidelberg interop last month Datalink protocol in itself is only
> describing the nature , format, type and semantics or descriptions
> of the links and will say nothing about the  ressources parameters
> themselves

Hm -- was that the agreement (I had to be largely in another session,
sorry)?  If so, I'd find that regrettable, since if I don't know the
parameters a service takes the link to it isn't terribly useful, is
it?

Anyway, I seem to remember some session that did contain talk about
transmitting parameter metadata, and half the point of this whole
thing is pointing out that we don't know how to do that for
STC-S-valued parameters.  And we'll need to do that, whether in the
immediate DataLink response or in a secondary service response; my
worries aren't really affected by that location.

Why people don't like structured parameters
===========================================

Well, from the answers, frankly, I've not been able to pull many
arguments against what I called "structured parameters", i.e., X_MIN
and X_MAX for intervals (and possibly more of this type).

In terms of concrete criticism, Pat offered:

> features)!! I do not agree at all with the idea of trying to do everything
> with primitive datatypes and hordes of parameters.

Well, at some level you'll have to have those parameters anyway --
somewhere in your code there'll be "5th component of center of
sphere".  The question is: Do you, for your protocol, define a
special serialization (on top of what HTTP/VOTable already give you)
for combinations/groups of those parameters or don't you?

Of course, there's some value in abstraction, and being able to say
"This is a 5-Sphere, where this is the center and that is the radius"
*may* make things simpler -- though I have to say I doubt it's a big
advantage on a protocol level, and I'd really like to see convincing
use cases.

By the way, it would of course be a sane way to *implement* a
DataLink endpoint to de/serialize, say, rectangle objects directly
from and to collections of HTTP parameters -- I just maintain that
the serialization doesn't need to know about what a given software
does with the message, and making it aware of it is a complication
rather than a simplification.

*If* we decide to define such types, let's not take that lightly lest
we end up with ambiguous serializations and no ways to define
capabilities, the domain of the parameter, etc.  

I have to confess I was alarmed when I read Pat saying:

> beyond primitive integers and floats and strings. At CADC we have been
> treating shapes (circles, polygons, etc) and intervals a real datatypes**
> [...]
> ** not advocating that we open VOTable up again and add data types

*That* is exactly the problem.  Take a look at VODataService 1.1.
There's already three type systems defined in there -- actually,
it's even a hierarchy with, in total, six members:

(DataType) -- SimleDataType
   `-- (TableDataType) --  VOTableType
            `-- (TAPDataType) -- TAPType

-- and that still doesn't reflect the duct-tape that is xtype.

What Pat is suggesting here is, in essence, to add a fourth type,
DataLinkParameterType, say.  After all, you'll have to declare the
type of your parameters somewhere (and preferably somewhere in the
Registry, too), so if we don't extend what we have, we'd have to
invent something else, hence another type system.

Please don't!

My take on this: if there's a strong use case requiring those complex
types, then it should carry for adding them to VOTable, too; if it's too
weak for that, then maybe they shouldn't be in the protocols in the
first place.

Note, however, that all the propsed new types (intervals, geometries)
would also require extensions to the VOTable VALUES element, e.g.,
because, being isomorphic to the R^n, none of them is (meaningfully)
orderable, and hence MIN and MAX aren't terribly useful.  Of course,
the original sin has been committed there already since we have
arrays and complex numbers, for which MIN and MAX aren't well-defined
either.

I'm sorry if I missed other counter-arguments -- if so, would you
raise them again?

Why People want STC-S
=====================

I *thought* the reason to want STC-S was to allow non-rectangular
cutouts.  But both François and Doug appeared to imply they didn't
actually want this; Doug wrote:

> I do think STC-S is a viable way to express a multi-dimensional bounding
> box or region, for discovery queries and simple cutouts expressed in
> world coordinates, so long as we limit the complexity.  Just expressing
> a range of values (or possibly simple region) in each coordinate axis is
> simple enough.  This much would not be that hard to parse, and could

Well, for *that* task we don't need to invent serialization/metadata
declaration on top of what we already have -- _MIN/_MAX is enough and
more generic since it easily allows axes that in STC-S would be hard
to describe.

Another argument might have been to allow more or less arbitrary
reference frames (even positions?) in server input; I've always
maintained coordinate transformation is either trivial (in which case
it doesn't merit protocol support) or too hard to perform without
knowing the science use case (in which case the server can't do it
anyway).  And indeed, Arnold suggested:

> If we allow STC-S strings to be used to provide the coordinate
> metadata in a DAL protocol, that protocol's standard can very well
> state that only ICRS and GALACTIC are allowed for the spatial
> reference system and that these are required to be 2-D spherical.

Is *that* worth the complexities of introducing a special
serialization format?  *That* transformation can be written in two
lines of awk.  And restricting to 2D spherical seems to severely
limit what that can be used for anyway -- what would our theory
people have to say about such a limitation?

As to Pat's (admittedly valid) argument:

> beyond primitive integers and floats and strings. At CADC we have been
> treating shapes (circles, polygons, etc) and intervals a real datatypes**
> for a long time now and once you do that all the confusion goes away -- and

I've already said that there's cases where you want them (ADQL, say)
-- but that I can't see how DataLink is one of them. Given how hard
it is to define type systems (including their valid values) sensibly
and robustly, there should be a really strong reason to expose them
on a protocol level (as opposed to just using them internally or
within custom interfaces).

Doug has, in addition:

> back-end processing.  If has the advantage of allowing simple
> multi-dimensional regions to be specified with a single parameter.

Is the specification with a single parameter actually a measurable
advantage?  In what use case does it make a difference if the
parameter set has a custom serialization (i.e., STC-S) or just the
normal HTTP www-form-urlencoded serialization?

Again, if I've neglected some argument, please do tell me off and
maybe try making your point again.

On Declaring Protocol Parameters
================================

Knowing full well I'm sounding like a broken record: This is what all
this is really about.  We *must* define our services such that the
knowledge of the protocol together with whatever service metadata we
specify lets a (machine) client discover how valid requests to the
service are constructed (i.e., in particular what parameters are
supported and what literals are expected in each parameter).  Bonus
points if the client can suggest values that actually return values
to the user to alleviate the horror vacui in front of an interface
like this:

    Enter parameters:

    _________________________________

                    [Cancel]   [Send]

Since that point is so dear to my heart after the SSAP experience,
let me briefly reply to Doug:

> >This requires a short excursion: I strongly believe we should stop
> >lying.  We're currently lying when we, as in current SSAP, say something
> >like <PARAM name="INPUT:BAND" datatype="double" unit="m"...> in the
> >service metadata.
> >What clients are expected to pass in is (for most services) something
> >like "1e-7/", which clearly is *not* a double literal.  The SSAP spec
> 
> BAND is an example of a custom datatype much as Pat suggested.  The
> actual datatype is not double, but ordered rangelist.  List is obviously

Well, that's the issue.  If you look at the sample metadata response
from the SSAP standard, it says:

  <PARAM name="INPUT:BAND" value="ALL" datatype="char" arraysize="*"> 
      <DESCRIPTION> 
          Spectral coverage: Several values can be combined in a 
          comma separated list. Below values are treated case insensitive. 
          All spectra returned by this service belong mainly to the optical 
          reaching to the infrared regime. Therefore, the other values 
          won't yield any matching records in the query response. 
          Alternatively the wavenlength can be given in meters or as a 
          range thereof. 
      </DESCRIPTION> 
      <VALUES> 
          <OPTION value="ALL"/> 
          <OPTION value="radio"/> 
          <OPTION value="millimeter"/> 
          <OPTION value="infrared"/> 
          <OPTION value="optical"/> 
          <OPTION value="ultraviolet"/> 
          <OPTION value="x-ray"/> 
          <OPTION value="gamma-ray"/> 
      </VALUES> 
  </PARAM> 

(p. 61).  That, fortunately, is not quite a lie (we're not saying:
this is a float), but it's not the whole truth either.  As you can
see, a client would assume it can use "infrared" and a few others to
fill that *string* that BAND is and that's it.  There's no way it
could figure out this is an ordered rangelist.  And is it?  The
comment appears to suggest otherwise even to humans.  Incidentally,
the attempt to save parameters in this case opens up new questions --
should

radio,5e-7/7e-7

be an allowed literal here?  Let's not do things like that again.

And I cannot resist commenting on Doug's observations that lists are

> a very common and indispensible datatype in most high level languages; a
> range or rangelist is also quite common, and indispensible for many use
> cases.  In the case of parameters like BAND and TIME, rangelists are
> required for many use cases as we need to include or exclude selected

Well, if the part about "indispensable" is true (and I doubt it given
how few services actually understand and correctly implement the
syntax, and the fact that of ~4000 "user" SSA queries with BAND I've
seen here only one contained a comma and none a semicolon), then we
need to figure out how to declare the syntax and semantics supported
by a parameter in the metadata response.  That, or we keep clients in
the dark about what they can and cannot pass to a given service.

But there's a deeper issue here that goes to the fundamentals of
protocol design: Programming languages are (usually) equivalent to
Turing machines, and there's a good reason for that (most interesting
problems need a Turing machine to solve).  Protocols usually are not,
and there's a host of even better reasons for that (e.g., even
deciding whether such protocol messages are valid might take an
arbitrary amount of time and space).

Now, admittedly lists don't shove us across any line here (typically,
they'd still be in the regular domain), but in principle the argument
"this is practice in programming language X" is not a good one when
we're talking about protocols.  To conclude this digession I
recommend the (for me) eye opening talk "The Science of Insecurity", 

http://mirror.fem-net.de/CCC/28C3/mp4-h264-LQ/28c3-4763-en-the_science_of_insecurity_h264-iprod.mp4

> spectral regions (or time regions) when filtering data.  Range certainly
> is a mandatory basic construct, and a rangelist is a trivial extension
> of the concept.

Range is easily covered by _MIN and _MAX (as even the SSAP spec
itself showcases, p. 53).  Rangelist is *not* a trivial extension of the
concept, though, as warranted by the fact that while ranges
themselves work marvellously with VOTable data types and
www-form-urlencoded serialization, rangelists do not (without ugly
hacks).

What's this about, part II
==========================

In conclusion, to maybe pull in some passive listeners -- this
discussion is about declaring protocol parameters.  I'm using the SSA
way of declaring those; the problem is the same for datalink
services, so at least in principle the arguments apply.

What I propose is that if you offer cutouts within a spectral data
cube, you'd say (roughly)

<PARAM name="INPUT:RA_MIN" datatype="double" ucd="pos.eq.ra"
  unit="deg">
  <VALUES><MIN>2.3</MIN><MAX>4.2</MAX></VALUES>
</PARAM>
<PARAM name="INPUT:RA_MAX" datatype="double" ucd="pos.eq.ra"
  unit="deg">
  <VALUES><MIN>2.3</MIN><MAX>4.2</MAX></VALUES>
</PARAM>

<PARAM name="INPUT:DEC_MIN" datatype="double" ucd="pos.eq.dec"
  unit="deg">
  <VALUES><MIN>-78</MIN><MAX>-77</MAX></VALUES>
</PARAM>
<PARAM name="INPUT:DEC_MAX" datatype="double" ucd="pos.eq.dec"
  unit="deg">
  <VALUES><MIN>-78</MIN><MAX>-77</MAX></VALUES>
</PARAM>

<PARAM name="INPUT:SPEC_MIN" datatype="double" ucd="em.wl"
  unit="m">
  <VALUES><MIN>4e-7</MIN><MAX>7e-7</MAX></VALUES>
</PARAM>
<PARAM name="INPUT:SPEC_MAX" datatype="double" ucd="em.wl"
  unit="m">
  <VALUES><MIN>4e-7</MIN><MAX>7e-7</MAX></VALUES>
</PARAM>

-- no magic, all of the stuff exists and can be readily used in
VOTable, the clients can figure out the complete physics, and the
standard could still say "If you have declinations, your parameter
must be called DEC" if we want.  This XML is simple to generate, the
messages described are parsed by your HTTP library.

If you actually want to tell min and max apart without looking at the
names (which I think would be perfectly all right), it would be
trivial to add, say, meta.max and meta.min as UCD words (we shouldn't
use stat.min and stat.max here).

If you want full STC metadata on this, there's the STC-in-VOTable
note immediately applicable here.

If you want more structure, you could still have (say)

<GROUP name="INPUT:RA_INTERVAL">
  <PARAMRef name="INPUT:RA_MIN"/>
  <PARAMRef name="INPUT:RA_MAX"/>
</GROUP>

The STC-S solution would, as far as my imagination goes, look
something like this:

<some magic> would then tell a client that, in this case, a string
like

Union ICRS Circle 2.2 3.4 4.5 PositionInterval 1.2 2.2 2.3 4.5 
  Position 3.4 4.5 unit deg size 0.001 0.001
SpectralInterval TOPOCENTER 4000 6000 unit Angstrom PixSize 2

would probably be meaningful to the server, whereas

TimeInterval MJD 4567 6384
Box CART3D ICRS 0 1 2 3 4 5 unit km

probably would not.

And yes, we might cut and crop STC-S to make this a feasible problem.
But then STC-S would no longer work as a fairly human-graspable way
to input STC specifications, which would be a grave collateral damage
at least for me (see also the original mail of this thread).

Conclusion: Please, everyone involved in the DataLink effort, think
hard if you need STC-S or any complex geometry at all.  And if you
find you have to, think hard on how to declare valid literals,
ranges, and all that.

Cheers,

        Markus