Datalink data access services
Markus Demleitner
msdemlei at ari.uni-heidelberg.de
Thu Dec 12 05:04:56 PST 2013
Dear DAL folks,
It's Christmas time! As a special present, I'm bringing you a longer
piece on... well, read on. It's a something like 400 lines, and
contains quite a bit of standardese, so you may want to reserve a quiet
moment for this. I ask for your patience, though, as I believe this
will let us save a complete standards text, which would be far more work
overall.
As I've already tried to convey in Waikloloa, I'm convinced datalink
doesn't need terribly much to just work(TM) as a generic server-side
processing service -- this has been discussed as "accessData" in
Waikoloa --, pretty much in the way the proposed (and implemented) SSA
getData extension already works (e.g.,
http://docs.g-vo.org/talks/2012-urbana-ssaevo.pdf; note that we're
withdrawing SSA getData for the more general datalink).
In this mail I'm proposing changes to WD-DataLink-1.0-20131022 that,
based on our experiences with SSA getData, should allow this. I've not
tried formatting the changes, but I believe the document shouldn't grow
by more than two pages (excluding example data).
(Almost) everything described here is already implemented in DaCHS;
we're going to provide proof-of-concept client-side support for spectral
manipulations in SPLAT RSN.
I'll present these changes in chapters, enumerated with uppercase
letters for easier references in praise or flame.
(A) Promise a bit more in the introduction
------------------------------------------
I propose to replace section "1.2.5 Free or Custom Services" with:
1.2.5 Data Access Services
In many data access scenarios, server-side processing of data is
highly desirable, typically to reduce the amount of data to be
transferred. Examples for such operations are cutouts, slicing of
cubes, and rebinning to a coarser grid. Other examples for server-side
operations include on-the-fly format conversion or recalibration. For
the purpose of this specification, we call such services data access
services. Datalink lets servers declare such data access services in a
way that a generic client can discover what operations are supported,
their semantics, and the domains of the operations' parameters. This
lets clients operate multiple independent data access services behind
a common user interface, allowing scenarios like "give me all voxels
around positions X in wavelength range Y of all spectral cubes from
services Z_1, Z_2, and Z_9".
(B) Forward Reference in Data Discovery Section
------------------------------------------------
Section 3.2 can become much shorter when there's a chapter on how to
describe services (see (E)). I'd therefore propose the following text:
3.2 Data Access Services in Discovery Responses
\label{sect:dl-in-discovery}
To communicate the capabilities of a data access service, a
DALI-compliant discovery service embeds one or more datalink data
access service resources (see section \ref{sect:dasr}) after the
VOTable RESOURCE of type "results". This data access service MUST
support a parameter with the name ID and the UCD meta.id;meta.main.
In it, the client MUST pass a discovered identifier.
To enable the client to decide which column of the discovery result
table contains the appropriate identifier, the PARAM element describing
the ID parameter MUST contain one LINK element with
content_role="ddl:id-source", the value of which is a relative URI to
the FIELD element in the discovery result table (i.e., its id preceded
by a hash).
For instance, if a result of an ObsCore query could contain
<FIELD name="obs_publisher_did" ID="datalinkID"
utype="obscore:Curation.PublisherDID"
ucd="meta.ref.url;meta.curation"
xtype="adql:VARCHAR" datatype="char" arraysize="256*" />
A data access service accepting identifiers from the corresponding
column would declare its ID parameter in the inputParams GROUP like
this:
<PARAM name="ID" arraysize="*" datatype="char" ucd="meta.id;meta.main"
value="">
<DESCRIPTION>The pubisher DID of the dataset of interest</DESCRIPTION>
<LINK content-role="ddl:id-source" value="#datalinkID"/>
</PARAM>
The ID value datalinkID is of course arbitrary.
As in datalink documents themselves, multiple service resources can be
present in a single discovery response. See section \ref{sect:dasr}
for more details.
Incidentally, I believe the document would work better if Section 3 were
shifted towards the back of the document so people already know what
we're talking about when reading this.
I've suggested the LINK-based technique to link (pun intended) the PARAM
and the FIELD with the identifier earlier this week and Pat appeared
underwhelmed. I'm still convinced the LINK technique is preferable to
the current WD technique (that, for the example, would look somewhat
like this:
<GROUP>
<PARAM name="ID" arraysize="*" datatype="char" ucd="meta.id;meta.main"
value="">
<DESCRIPTION>The pubisher DID of the dataset of interest</DESCRIPTION>
</PARAM>
<FIELDRef ref="datalinkID"/>
</GROUP>
on grounds that):
(a) it's more explicit; sibling elements within an (unadorned) GROUP
have no explitict semantics, a LINK child of a PARAM does
(b) it's easier to relate parents and children in DOMs than siblings
in my experience (try writing xpath expressions for pickung out
the FIELD id for both solutions)
(c) GREAT TAGS SAVINGS -- SAME CONTENT FOR NOT 10, NOT 20, NO,
A SENSATIONAL 30% LESS!
I'd like the FIELDref if we had a proper data model that had a
reference-valued field "source of the id parameter". Then, a <FIELDref
utype="dl:id-param-source" ref="datalinkID"/> would do the trick. Alas,
enthusiasm for proper data modelling in Datalink is lacking, and I've
given in to trying for that only in Datalink Deluxe, too -- the case in
Datalink is less compelling anyway than in some other standards I could
*cough* SCS *cough* metion.
(C) Allow Links to Metadata Pages?
----------------------------------
You may not have noticed it yet, but in (B) I've sneaked in that the
discovery protocols have descriptors not for datalink services (as in
the WD) but to data access services (i.e., containing parameter
definitions that let clients directly retrieve cutouts and such rather
than have to roundtrip to the datalink metadata services in between).
The rationale there is in multi-service queries it gets fairly tedious
to have to query the servers again to retrieve the datalink metadata.
The price to pay here is a fairly close coupling between the discovery
service and the datalink service. I am fairly convinced from my
implementation practice that you'll have this coupling anyway, as the
discovery service at least needs to know what identifiers the datalink
service knows about; on the contrary, when determining ranges of
parameters, it's very convenient to be able to use the discovery result
to compute those (e.g., the range of LAMBDA_X is immediately obtainable
from an SSA result set), so stuffing them into the discovery response
immediately (potentially) saves you more database queries.
After this consideration, I'm proposing to link to the access services
(meaning: with the full argument set) rather than datalink services
(i.e., the metadata producing ones). I give you the coupling thing is
serious though, so I'm open to argument whether these should be datalink
services or whether both should be allowed (the worst solution IMHO).
(D) Links to Services
---------------------
As services are now described in a RESOURCE rather than by IVORN, the
serviceType column in the link list has to change.
I would propose, in the table:
serviceDef reference to the description of a service at accessURL no
And then
4.4 serviceDef
If serviceDef is non-NULL, accessURL points to a service that, in
general, requires additional parameters to yield a useful result.
Note that serviceDef can and should be NULL even if accessURL points
to some resource generated on the fly if accessURL is intended to be
used with no additional parameters (e.g. direct download links or
links to dynamic content where all parameters are already included in
the link, such a on-the-fly preview generation).
Typically, serviceDef contains a relative URI to a data access service
descriptor (see \ref{sect:dasr}), i.e., the RESOURCE's id prepended
with a hash sign.
[I'd like to strike the following paragraph; even for standard services,
I'd rather just use a descriptor with its standardId PARAM set]
The serviceDef column can also contain IVO standardIDs for
standard IVOA capabilities to indicate some the accessURL points to a
service complying to that standard (e.g, SSAP), and a standard client
can be used to operate the service. The details of how to operate
such a service with a datalink identifier are defined in the service
standard [Should we specify it for SIAP and SSAP here?].
(E) Data Access Service Descriptors
-----------------------------------
This would be a new section, probably a subsection of the current
section 5. Here's some prose I'd suggest:
5.3 Data Access Service Descriptors
Data access services are described in VOTable RESOURCE elements with a
type="service" attribute. In datalink responses, they must also have an
ID, as they are referenced from the links table. In data discovery
responses, their ID parameter must reference the FIELD in the
type="result" RESOURCE describing the column containing the dataset
identifiers (see sect.~\ref{sect:dl-in-discovery}).
The resource content consists of PARAM elements containing general
service metadata, a GROUP describing the input parameters of the
service, and optionally further metadata.
The following PARAMs, identified by their names, are defined by this
standard:
name description required
accessURL URL to invoke the capability yes
standardID URI for the capability no
IVORN IVOA registry identifier no
The access URL may contain GET-type arguments; clients must parse the
URL to decide how to add arguments. Absent a standardID, data access
services use the GET HTTP method.
The standardID, if present, contains an IVORN referring to a service
standard. This allows supporting more complex service contracts if
necessary. It also can be used to refer to S*AP services that allow
retrieval of the data sets.
IVORN, if present, is the identifier of the service in the registry.
[Is there a scenario where that could become useful? Much as I am a
Registry afficionado, I can't really see what to do with this IVORN
here, and I still don't really believe in registering datalink
services.]
The GROUP describing the input parameters is identified by having the
name inputParams. For each input parameter, there is a VOTable PARAM
element, the name of which gives the name of the HTTP parameters.
Implementors SHOULD make sure to give as much metadata as possible here,
in particular as regards UCDs, units, descriptions useful also to users
not familiar with the underlying data, as well as ranges of valid values
or enumerations of the values accepted by each parameter. A non-empty
value attribute on a VOTable PARAM should be used as a default for the
parameter by a client, in particular in user interfaces to the
service.
Parameters that have roles in known data models SHOULD be marked up as
recommended by the either [VOTable], the data model documents, or any
forthcoming IVOA recommendation. This is particularly important for
parameters that are part of Space-Time-Coordinates.
(F) Guidelines for Data Access Services
---------------------------------------
Here's where the meat is, finally. This would become another toplevel
section. My hope is that this is enough to fulfill our cube access use
case together with all other "simple" cut-out and recalibration use
cases.
6 Guidelines for Data Access Services and their Use
Data Access Services will in general be called by a discovery client
with minimal human intervention. It is therefore important to enable
clients to infer the semantics of the parameters offered and to enable
robust operation.
On errors, data access services MUST NOT return 200 status codes, as
clients have in general no way to tell error messages from useful
content in a data access content. Instead, they SHOULD use one of 400
(bad syntax), 404 (not found), 422 (semantics problems), or 500 (server
problems). Clients SHOULD be prepared to handle other HTTP status codes
like redirects (301 or 303) and authentication requests (401).
Error messages MUST be text/plain and should be rendered without
reflowing.
Data access services SHOULD raise an error when they receive a parameter
unknown to them, and clients SHOULD never pass a parameter to a service
unless it has declared support for it in its data access service
resource. The purpose of this recommendation is that ignored parameters
could mean downloads of gigabytes rather than kilobytes in a data access
context; this should not happen due to a client error.
Data access services SHOULD return the unmodified dataset when
passed the ID only.
If a specific set of arguments yields the empty data set, Data Access
Services SHOULD return an empty table, image, or the like, rather than
an error message. Datalink clients SHOULD not render such data sets or
give modal feedback for them but give their users some way to diagnose
that empty datasets have been returned in a nonmodal way (e.g., in a
separate log window).
6.1 Common Parameters
Many data access services share the axes along which cutouts are to be
performed or ways to manipulate the data. It is highly desirable that
clients can drive them in the same way to achieve multi-service
operation. They SHOULD use the following parameter characteristics if
appropriate, and the MUST NOT use them if the corresponding physics is
different.
In general, data access services will accept a dataset identifier under
the ID parameter used by the datalink service. Clients MUST pass in
the dataset identifier(s) given in the links table if the service
supports ID. Implementors may choose to encode the dataset identifier
in the service's accessURL in some other way, but they are mildly
discouraged from doing so.
The parameter characteristics consist of parameter name, ucd, and unit.
Clients SHOULD only assume the semantics given here if all three
characteristics match; while distinguished in the table, neither
clients nor servers should distinguish between an empty and missing
unit attributes on the PARAM elements.
paramater name UCD unit remarks
ID meta.id;meta.main (none)
DEC_MIN param.min;pos.eq.dec deg (1)
DEC_MAX param.max;pos.eq.dec deg (1)
RA_MIN param.min;pos.eq.ra deg (1,2)
RA_MAX param.max;pos.eq.ra deg (1,2)
LAT_MIN param.min;(see remark) deg (1)
LAT_MAX param.max;(see remark) deg (1)
LON_MIN param.min;(see remark) deg (2,3)
LON_MAX param.max;(see remark) deg (2,3)
LAMBDA_MIN param.min;em.wl m (4)
LAMBDA_MAX param.min;em.wl m (4)
FORMAT meta.code.mime (none) (5)
SPECRP spect.resolution (empty) (6)
FLUXCALIB phot.calib (none) (7)
PIX(n)_MIN param.min;pos.pixel.ax(n) pixel (8)
PIX(n)_MAX param.max;pos.pixel.ax(n) pixel (8)
KIND meta.code (none) (9)
Remarks
(1) ICRS coordinates; use LAT, LON for other systems
(2) Stitching line is assumed at 360 degrees; both _MIN and _MAX are
always positive, _MAX may assume values > 360
(3) LAT and LONG can refer to all kinds of spherical coordinate systems,
which will determine the UCD; do declare VOTable STC metadata for these
parameters. UCD fragments legal here include pos.eq.ra,
pos.galactic.lon, pos.supergalactic.lon, pos.ecliptic.lon,
pos.spher.lon for LON, and their lat counterparts for LAT.
(4) Even if a specific community uses frequencies or energy, spectral
cutouts should be possible by wavelength. Additional specificiation
according to community practices is, of course, allowed, but the
preferred solution is conversion on the presentation layer (i.e.,
community-targeted UIs presenting input options according to
community practices).
(5) This is for format conversion; the permitted values must be
enumerated in the PARAM's VALUE child
(6) This is for on-the-fly rebinning along spectral coordinates
(7) This is for on-the-fly recalibration; the calibrations understood
by the service must be enumerated in the PARAM's VALUE child.
(8) Here, n is an axis index, 1-based in accordance with FITS usage.
These are intended for pixel-wise cutout, in particular after the
client has obtained dataset metadata.
(9) KIND admits an open enumeration of "things" to ask for. Values
supported by a service must be enumerated in the PARAM's VALUE
CHILD. Predefined values include HEADER (metadata; for FITS images,
the primary header), HEADERn (the n-th header, 1-based, for
compound data), DATAn (the n-th data-metadata combination -- e.g.,
HDU in FITS files -- in compound data). Please use the predefined
values if appropriate for your data, do not use it if these concepts
do not match your data.
[Instead of _MAX and _MIN, there's something to be said for _VAL and
_SIZE, in particular less hassle with spherical coordinates. Choose one
and stick with it, I'd say; PLEASE help contribute the parameters you
want here.]
Additional parameter characteristics can be added to this table in minor
revisions of this document.
(E) Example documents
---------------------
Everyone loves examples. Validators are even better, but they're more
work, too. So, I propose two non-normative appendices with on example
document each for the datalink and data-discovery case. I'm giving
links to live documents here. I suspect it'd be a good idea to
hand-edit to cut some crap and maybe add some contrived and
"interesting" features once we know what these could be.
This stuff hasn't been validated, and no client has been implemented
against it yet. If there's disagreement between the document content and
the proposed text, the document content probably is wrong.
Maybe these examples should not be part of the document text but rather
stored somewhere else and just linked? They'll badly blow up the
doument otherwise, and the forests of the world will hate us.
Appendix A: Datalink Example Document (non-normative)
could contain, e.g., the output of:
curl -FID="ivo://org.gavo.dc/~?califa/data/V500/reduced_v1.3c/NGC6310.V500.rscube.fits" "http://dc.zah.uni-heidelberg.de/califa/q/dl/dlmeta" | xmlstarlet fo
I'm happy to clean that up as required.
Appendix B: Data Discovery Example Document (non-normative)
could contain,e.g., the output of
curl -FTARGETNAME="15 Mon" -FREQUEST=queryData http://dc.zah.uni-heidelberg.de/feros/q/ssa/ssap.xml | less
-- here, all the SSAP verbosity should pretty certainly be cut away. The
output right now contains one datalink service descriptor, pointing to
the metadata service as envisioned by the current WD, and one data
access service descriptor, pointing to the actual service as proposed
here. As mentioned above, I'm very sure one of them should disappear,
I'm keeping both at the moment in case people want to try both
alternatives.
Finally, here's a list of new UCDs we should try and get into the
"big list": param.min, param.max, pos.spher.(lon|lat), pos.pixel.ax1..ax7
(I have my doubts about the ax1..ax7, too, but it would be symmetric
with what's there for other pos.* UCDs).
Cheers,
Markus
More information about the dal
mailing list