Datalink data access services

Thu Dec 12 05:04:56 PST 2013

Dear DAL folks,

It's Christmas time!  As a special present, I'm bringing you a longer
piece on... well, read on.  It's a something like 400 lines, and
contains quite a bit of standardese, so you may want to reserve a quiet
moment for this.  I ask for your patience, though, as I believe this
will let us save a complete standards text, which would be far more work
overall.

As I've already tried to convey in Waikloloa, I'm convinced datalink
doesn't need terribly much to just work(TM) as a generic server-side
processing service -- this has been discussed as "accessData" in
Waikoloa --, pretty much in the way the proposed (and implemented) SSA
getData extension already works (e.g.,
http://docs.g-vo.org/talks/2012-urbana-ssaevo.pdf; note that we're
withdrawing SSA getData for the more general datalink).

In this mail I'm proposing changes to WD-DataLink-1.0-20131022 that,
based on our experiences with SSA getData, should allow this.  I've not
tried formatting the changes, but I believe the document shouldn't grow
by more than two pages (excluding example data).

(Almost) everything described here is already implemented in DaCHS;
we're going to provide proof-of-concept client-side support for spectral
manipulations in SPLAT RSN.

I'll present these changes in chapters, enumerated with uppercase
letters for easier references in praise or flame.

(A) Promise a bit more in the introduction
------------------------------------------

I propose to replace section "1.2.5 Free or Custom Services" with:

  1.2.5 Data Access Services

  In many data access scenarios, server-side processing of data is
  highly desirable, typically to reduce the amount of data to be
  transferred.  Examples for such operations are cutouts, slicing of
  cubes, and rebinning to a coarser grid.  Other examples for server-side
  operations include on-the-fly format conversion or recalibration.  For
  the purpose of this specification, we call such services data access
  services.  Datalink lets servers declare such data access services in a
  way that a generic client can discover what operations are supported,
  their semantics, and the domains of the operations' parameters.  This
  lets clients operate multiple independent data access services behind
  a common user interface, allowing scenarios like "give me all voxels
  around positions X in wavelength range Y of all spectral cubes from
  services Z_1, Z_2, and Z_9".

(B) Forward Reference in Data Discovery Section 
------------------------------------------------

Section 3.2 can become much shorter when there's a chapter on how to
describe services (see (E)).  I'd therefore propose the following text:

  3.2 Data Access Services in Discovery Responses

  \label{sect:dl-in-discovery}

  To communicate the capabilities of a data access service, a
  DALI-compliant discovery service embeds one or more datalink data
  access service resources (see section \ref{sect:dasr}) after the
  VOTable RESOURCE of type "results".  This data access service MUST
  support a parameter with the name ID and the UCD meta.id;meta.main.
  In it, the client MUST pass a discovered identifier.  

  To enable the client to decide which column of the discovery result
  table contains the appropriate identifier, the PARAM element describing
  the ID parameter MUST contain one LINK element with 
  content_role="ddl:id-source", the value of which is a relative URI to
  the FIELD element in the discovery result table (i.e., its id preceded
  by a hash).

  For instance, if a result of an ObsCore query could contain

    <FIELD name="obs_publisher_did" ID="datalinkID"
      utype="obscore:Curation.PublisherDID"
      ucd="meta.ref.url;meta.curation"
      xtype="adql:VARCHAR" datatype="char" arraysize="256*" />

  A data access service accepting identifiers from the corresponding
  column would declare its ID parameter in the inputParams GROUP like
  this:

    <PARAM name="ID" arraysize="*" datatype="char" ucd="meta.id;meta.main" 
      value="">
       <DESCRIPTION>The pubisher DID of the dataset of interest</DESCRIPTION>
       <LINK content-role="ddl:id-source" value="#datalinkID"/>
    </PARAM>

  The ID value datalinkID is of course arbitrary.  

  As in datalink documents themselves, multiple service resources can be
  present in a single discovery response.  See section \ref{sect:dasr}
  for more details.

Incidentally, I believe the document would work better if Section 3 were
shifted towards the back of the document so people already know what
we're talking about when reading this.

I've suggested the LINK-based technique to link (pun intended) the PARAM
and the FIELD with the identifier earlier this week and Pat appeared
underwhelmed.  I'm still convinced the LINK technique is preferable to
the current WD technique (that, for the example, would look somewhat
like this:

  <GROUP>
    <PARAM name="ID" arraysize="*" datatype="char" ucd="meta.id;meta.main" 
      value="">
       <DESCRIPTION>The pubisher DID of the dataset of interest</DESCRIPTION>
    </PARAM>
    <FIELDRef ref="datalinkID"/>
  </GROUP>

on grounds that):

(a) it's more explicit; sibling elements within an (unadorned) GROUP
    have no explitict semantics, a LINK child of a PARAM does
(b) it's easier to relate parents and children in DOMs than siblings
    in my experience (try writing xpath expressions for pickung out
    the FIELD id for both solutions)
(c) GREAT TAGS SAVINGS -- SAME CONTENT FOR NOT 10, NOT 20, NO,
    A SENSATIONAL 30% LESS!

I'd like the FIELDref if we had a proper data model that had a
reference-valued field "source of the id parameter".  Then, a <FIELDref
utype="dl:id-param-source" ref="datalinkID"/> would do the trick.  Alas,
enthusiasm for proper data modelling in Datalink is lacking, and I've
given in to trying for that only in Datalink Deluxe, too -- the case in
Datalink is less compelling anyway than in some other standards I could
*cough* SCS *cough* metion.

(C) Allow Links to Metadata Pages?
----------------------------------

You may not have noticed it yet, but in (B) I've sneaked in that the
discovery protocols have descriptors not for datalink services (as in
the WD) but to data access services (i.e., containing parameter
definitions that let clients directly retrieve cutouts and such rather
than have to roundtrip to the datalink metadata services in between).
The rationale there is in multi-service queries it gets fairly tedious
to have to query the servers again to retrieve the datalink metadata.

The price to pay here is a fairly close coupling between the discovery
service and the datalink service.  I am fairly convinced from my
implementation practice that you'll have this coupling anyway, as the
discovery service at least needs to know what identifiers the datalink
service knows about; on the contrary, when determining ranges of
parameters, it's very convenient to be able to use the discovery result
to compute those (e.g., the range of LAMBDA_X is immediately obtainable
from an SSA result set), so stuffing them into the discovery response
immediately (potentially) saves you more database queries.

After this consideration, I'm proposing to link to the access services
(meaning: with the full argument set) rather than datalink services
(i.e., the metadata producing ones).  I give you the coupling thing is
serious though, so I'm open to argument whether these should be datalink
services or whether both should be allowed (the worst solution IMHO).

(D) Links to Services
---------------------

As services are now described in a RESOURCE rather than by IVORN, the
serviceType column in the link list has to change.

I would propose, in the table:

serviceDef  reference to the description of a service at accessURL     no

And then

  4.4 serviceDef

  If serviceDef is non-NULL, accessURL points to a service that, in
  general, requires additional parameters to yield a useful result.
  Note that serviceDef can and should be NULL even if accessURL points
  to some resource generated on the fly if accessURL is intended to be
  used with no additional parameters (e.g. direct download links or
  links to dynamic content where all parameters are already included in
  the link, such a on-the-fly preview generation). 

  Typically, serviceDef contains a relative URI to a data access service
  descriptor (see \ref{sect:dasr}), i.e., the RESOURCE's id prepended
  with a hash sign.

  [I'd like to strike the following paragraph; even for standard services,
  I'd rather just use a descriptor with its standardId PARAM set]
  The serviceDef column can also contain IVO standardIDs for
  standard IVOA capabilities to indicate some the accessURL points to a
  service complying to that standard (e.g, SSAP), and a standard client
  can be used to operate the service.  The details of how to operate
  such a service with a datalink identifier are defined in the service
  standard [Should we specify it for SIAP and SSAP here?].

(E) Data Access Service Descriptors
-----------------------------------

This would be a new section, probably a subsection of the current
section 5.  Here's some prose I'd suggest:

  5.3 Data Access Service Descriptors

  Data access services are described in VOTable RESOURCE elements with a
  type="service" attribute.  In datalink responses, they must also have an
  ID, as they are referenced from the links table.  In data discovery
  responses, their ID parameter must reference the FIELD in the
  type="result" RESOURCE describing the column containing the dataset
  identifiers (see sect.~\ref{sect:dl-in-discovery}).

  The resource content consists of PARAM elements containing general
  service metadata, a GROUP describing the input parameters of the
  service, and optionally further metadata.

  The following PARAMs, identified by their names, are defined by this
  standard:

  name            description                    required
  accessURL       URL to invoke the capability   yes
  standardID      URI for the capability         no
  IVORN           IVOA registry identifier       no

  The access URL may contain GET-type arguments; clients must parse the
  URL to decide how to add arguments.  Absent a standardID, data access
  services use the GET HTTP method.

  The standardID, if present, contains an IVORN referring to a service
  standard.  This allows supporting more complex service contracts if
  necessary. It also can be used to refer to S*AP services that allow
  retrieval of the data sets.  

  IVORN, if present, is the identifier of the service in the registry.
  [Is there a scenario where that could become useful?  Much as I am a
  Registry afficionado, I can't really see what to do with this IVORN
  here, and I still don't really believe in registering datalink
  services.]

  The GROUP describing the input parameters is identified by having the
  name inputParams.  For each input parameter, there is a VOTable PARAM
  element, the name of which gives the name of the HTTP parameters.
  Implementors SHOULD make sure to give as much metadata as possible here,
  in particular as regards UCDs, units, descriptions useful also to users
  not familiar with the underlying data, as well as ranges of valid values
  or enumerations of the values accepted by each parameter.  A non-empty
  value attribute on a VOTable PARAM should be used as a default for the
  parameter by a client, in particular in user interfaces to the
  service.

  Parameters that have roles in known data models SHOULD be marked up as
  recommended by the either [VOTable], the data model documents, or any
  forthcoming IVOA recommendation.  This is particularly important for
  parameters that are part of Space-Time-Coordinates.

(F) Guidelines for Data Access Services
---------------------------------------

Here's where the meat is, finally.  This would become another toplevel
section.  My hope is that this is enough to fulfill our cube access use
case together with all other "simple" cut-out and recalibration use
cases.

  6 Guidelines for Data Access Services and their Use

  Data Access Services will in general be called by a discovery client
  with minimal human intervention.  It is therefore important to enable
  clients to infer the semantics of the parameters offered and to enable
  robust operation.

  On errors, data access services MUST NOT return 200 status codes, as
  clients have in general no way to tell error messages from useful
  content in a data access content.  Instead, they SHOULD use one of 400
  (bad syntax), 404 (not found), 422 (semantics problems), or 500 (server
  problems).  Clients SHOULD be prepared to handle other HTTP status codes
  like redirects (301 or 303) and authentication requests (401).
  Error messages MUST be text/plain and should be rendered without
  reflowing.

  Data access services SHOULD raise an error when they receive a parameter
  unknown to them, and clients SHOULD never pass a parameter to a service
  unless it has declared support for it in its data access service
  resource.  The purpose of this recommendation is that ignored parameters
  could mean downloads of gigabytes rather than kilobytes in a data access
  context; this should not happen due to a client error.

  Data access services SHOULD return the unmodified dataset when
  passed the ID only.

  If a specific set of arguments yields the empty data set, Data Access
  Services SHOULD return an empty table, image, or the like, rather than
  an error message.  Datalink clients SHOULD not render such data sets or
  give modal feedback for them but give their users some way to diagnose
  that empty datasets have been returned in a nonmodal way (e.g., in a
  separate log window).

  6.1 Common Parameters

  Many data access services share the axes along which cutouts are to be
  performed or ways to manipulate the data.  It is highly desirable that
  clients can drive them in the same way to achieve multi-service
  operation.  They SHOULD use the following parameter characteristics if
  appropriate, and the MUST NOT use them if the corresponding physics is
  different.

  In general, data access services will accept a dataset identifier under
  the ID parameter used  by the datalink service.  Clients MUST pass in
  the dataset identifier(s) given in the links table if the service
  supports ID.  Implementors may choose to encode the dataset identifier
  in the service's accessURL in some other way, but they are mildly
  discouraged from doing so.

  The parameter characteristics consist of parameter name, ucd, and unit.
  Clients SHOULD only assume the semantics given here if all three
  characteristics match; while distinguished in the table, neither
  clients nor servers should distinguish between an empty and missing
  unit attributes on the PARAM elements.

  paramater name  UCD                                unit     remarks
  ID              meta.id;meta.main                  (none)
  DEC_MIN         param.min;pos.eq.dec               deg      (1)
  DEC_MAX         param.max;pos.eq.dec               deg      (1)
  RA_MIN          param.min;pos.eq.ra                deg      (1,2)
  RA_MAX          param.max;pos.eq.ra                deg      (1,2)
  LAT_MIN         param.min;(see remark)             deg      (1)
  LAT_MAX         param.max;(see remark)             deg      (1)
  LON_MIN         param.min;(see remark)             deg      (2,3)
  LON_MAX         param.max;(see remark)             deg      (2,3)
  LAMBDA_MIN      param.min;em.wl                    m        (4)
  LAMBDA_MAX      param.min;em.wl                    m        (4)
  FORMAT          meta.code.mime                     (none)   (5)
  SPECRP          spect.resolution                   (empty)  (6)
  FLUXCALIB       phot.calib                         (none)   (7)
  PIX(n)_MIN      param.min;pos.pixel.ax(n)          pixel    (8)
  PIX(n)_MAX      param.max;pos.pixel.ax(n)          pixel    (8)
  KIND            meta.code                          (none)   (9)

  Remarks
  (1) ICRS coordinates; use LAT, LON for other systems
  (2) Stitching line is assumed at 360 degrees; both _MIN and _MAX are
      always positive, _MAX may assume values > 360
  (3) LAT and LONG can refer to all kinds of spherical coordinate systems,
      which will determine the UCD; do declare VOTable STC metadata for these
      parameters.  UCD fragments legal here include pos.eq.ra,
      pos.galactic.lon, pos.supergalactic.lon, pos.ecliptic.lon,
      pos.spher.lon for LON, and their lat counterparts for LAT.
  (4) Even if a specific community uses frequencies or energy, spectral
      cutouts should be possible by wavelength. Additional specificiation
      according to community practices is, of course, allowed, but the
      preferred solution is conversion on the presentation layer (i.e.,
      community-targeted UIs presenting input options according to
      community practices).
  (5) This is for format conversion; the permitted values must be
      enumerated in the PARAM's VALUE child
  (6) This is for on-the-fly rebinning along spectral coordinates
  (7) This is for on-the-fly recalibration; the calibrations understood
      by the service must be enumerated in the PARAM's VALUE child.
  (8) Here, n is an axis index, 1-based in accordance with FITS usage.
      These are intended for pixel-wise cutout, in particular after the 
      client has obtained dataset metadata.
  (9) KIND admits an open enumeration of "things" to ask for.  Values
      supported by a service must be enumerated in the PARAM's VALUE
      CHILD.  Predefined values include HEADER (metadata; for FITS images,
      the primary header), HEADERn (the n-th header, 1-based, for
      compound data), DATAn (the n-th data-metadata combination -- e.g.,
      HDU in FITS files -- in compound data).  Please use the predefined
      values if appropriate for your data, do not use it if these concepts
      do not match your data.

  [Instead of _MAX and _MIN, there's something to be said for _VAL and
  _SIZE, in particular less hassle with spherical coordinates.  Choose one
  and stick with it, I'd say; PLEASE help contribute the parameters you
  want here.]

  Additional parameter characteristics can be added to this table in minor
  revisions of this document.

(E) Example documents
---------------------

Everyone loves examples.  Validators are even better, but they're more
work, too.  So, I propose two non-normative appendices with on example
document each for the datalink and data-discovery case.  I'm giving
links to live documents here.  I suspect it'd be a good idea to
hand-edit to cut some crap and maybe add some contrived and
"interesting" features once we know what these could be.

This stuff hasn't been validated, and no client has been implemented
against it yet. If there's disagreement between the document content and
the proposed text, the document content probably is wrong.

Maybe these examples should not be part of the document text but rather
stored somewhere else and just linked?  They'll badly blow up the
doument otherwise, and the forests of the world will hate us.

  Appendix A: Datalink Example Document (non-normative)

could contain, e.g., the output of:

curl -FID="ivo://org.gavo.dc/~?califa/data/V500/reduced_v1.3c/NGC6310.V500.rscube.fits" "http://dc.zah.uni-heidelberg.de/califa/q/dl/dlmeta" | xmlstarlet fo

I'm happy to clean that up as required.

   Appendix B: Data Discovery Example Document (non-normative)

could contain,e.g., the output of

curl -FTARGETNAME="15 Mon" -FREQUEST=queryData http://dc.zah.uni-heidelberg.de/feros/q/ssa/ssap.xml | less

-- here, all the SSAP verbosity should pretty certainly be cut away.  The
output right now contains one datalink service descriptor, pointing to
the metadata service as envisioned by the current WD, and one data
access service descriptor, pointing to the actual service as  proposed
here.  As mentioned above, I'm very sure one of them should disappear,
I'm keeping both at the moment in case people want to try both
alternatives.

Finally, here's a list of new UCDs we should try and get into the
"big list": param.min, param.max, pos.spher.(lon|lat), pos.pixel.ax1..ax7

(I have my doubts about the ax1..ax7, too, but it would be symmetric
with what's there for other pos.* UCDs).

Cheers,

             Markus