WD-DataLink-1.0

Tue Dec 10 05:21:24 PST 2013

Dear DAL list,

On Fri, Oct 25, 2013 at 10:40:52AM -0700, Patrick Dowler wrote:
> 
> The first official and more or less complete WD for DataLink is now
> available in the document repository (in the Documents in Progress
> section).
> 
> Direct link is here:
> 
> http://www.ivoa.net/documents/DataLink/index.html

I've now updated my prototype datalink service to something "pretty
close" to the WD (for some features, I chose to implement the changes
I'm suggesting below).

This is a longish mail, but quite a bit of it is essentially redactional.
I've tried to put what I think might be contentious near the top.

There's also the bigger issue that I'm not entirely happy with the "free
service" part.  This requires a fairly large chunk of text that I'm
still preparing.

Is Datalink a stand-alone service?
----------------------------------

In Sect. 2, datalink is designed as a full, registrable, DALI-compliant
service.  The implementation as an extra service is fairly natural, and
so that's what I've done, too, but I don't think the standard should
mandate this.  In principle, it should be possible to have datalink as a
capability of another service (e.g., the ObsTAP one).

My arguments:

(1) I expect datalink services to be fairly closely bound to concrete
data collections and hence services, as in general you'll need to know
quite a bit about your data

(2) Having a separate registry entry for a datalink service clutters the
registry with services that need no discovery, at least not as long as
you cannot discover what IDs a service will have data for.

(3) VOSI availability and examples can be re-used from the embedding
service with no loss of functionality.

This would mean striking the entire text between "2 Resources" and the
2.1. headline, and probably renaming the section "2 The Datalink
Endpoint" (or "capability" if you prefer").  There would be some
redactional changes further down (I'm making suggestions further down).

RESPONSEFORMAT?
---------------

Is there a use case for that?  This appears to me an overgeneralization,
and it's a liability if we more or less require certain metadata to be
transferred; this, in particular, concerns STC metadata, but even
tivialities like the unit of contentLength, for which there's language
in the draft just to support RESPONSEFORMAT.  It also seems highly
doubtful that service metadata could usefully be transferred in formats
other than VOTable.

The use case "support naive javascript clients" is, I would argue,
already satisified by requiring TABLEDATA serialization for the
VOTable response.

If we're going to go forward with this, we'll have to severely limit
what we can express in datalink responses, or we'll have to accept
dramatically different semantics in differing output formats.

Killing RESPONSEFORMAT would also do away with 5.1.2, which is good, as
optional features are the curse of interoperability...

Case issues
-----------

I'm in favour of explicitely saying that at least the ID parameter is
*not* case-insensitive.

Then, the column names in the table descibed in "4 List of Links" are
camelCase.  I'm not a big fan of that when we're talking about names
that might end up in an actual SQL-based database; granted, that's not
what we recommend now, but I can totally see exposing a database table
of "pre-rendered" datalinks via TAP.

When that happens, we don't want mixed case in there.  The reason is
that SQL becomes really mystifying when you have delimited identifiers
in mixed case, and in particular the MySQL crowd has a tendency to
over-delimit.  What happens then is that

select accessURL

will fail, as will

select accessUrl,

select accessurl

and everything else except

select "accessURL".

It's easy to mitigate this kind of issue by just having all-lower case
identifiers and separate words with underscores.  So, I'd like to have 

id, access_url, error_message, service_type, semantics, content_type,
content_length

as column names (where, of course, I'd on principle prefer if concepts
that exist in obscore had the same name in both obscore and datalink).

2.4 Capabilities
----------------

If we agree on datalink being an auxillary endpoint rather than a
full-fledged DALI service, then this section would become:

  A service with one or more Datalink endpoint(s) SHOULD declare them
  in its VOSI capabilities resource as well as its registry record.  The
  capability is a standard VOResource capability (i.e., there is no
  dedicated Datalink registry extension) with a standard id of

    ivo://ivoa.net/std/DataLink/v1.0

  The capability MUST have at least one interface of the type
  vs:ParamHTTP, where vs corresponds to the namespace
  http://www.ivoa.net/xml/VODataService/1.1 or any earlier or later
  namespace URI of VODataService version 1.x.  As usual in Registry
  documents, the recommended namespace prefixes (in this case, vs) SHOULD 
  be used if at all possible.

  Here is an example for such a capability [or rather have that in
  an appendix with a clear indication that this is not normative?]:

  <capability 
    xmlns:vs="http://www.ivoa.net/xml/VODataService/v1.1" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    standardID="ivo://ivoa.net/std/DataLink/v1.0">
    <interface xsi:type="vs:ParamHTTP">
      <accessURL use="base"
        >http://example.com/datalink</accessURL>
      <queryType>GET</queryType>
      <resultType>application/x-votable+xml;content=datalink</resultType>
      <param std="true">
        <name>ID</name>
        <description>The pubisher DID of the dataset of interest</description>
        <ucd>meta.id;meta.main</ucd>
        <dataType>string</dataType>
      </param>
    </interface>
  </capability>

  Multiple capability elements with the Datalink standard identifier may
  be included a a capabilities element; this is typically used if they
  differ in protocol (http vs. https) and/or authentication requirements.

I've taken the liberty of changing the standardID; IMHO these should
resolve to actual standard resource records, and thus any URI referring
to a fragment is out with current VOResource.

Note that that would still allow people to register standalone datalink
services if that's what they want.

Now, if we want to allow discovery queries on these (i.e., query all
known datalink services to see which has a dataset), we should
explicitely say so and urge people to register (DaCHS, for example, does
not by default create a public capability for a datalink endpoint unless
the user orders it; this is for consistency, code simplicity, and to
reduce registry clutter.  If global discovery is what we want, I'd
change this policy).

Oh, and the use="base" vs. use="full" -- I've always understood this as:
on GET-based services with parameters, we have use="base".  I'm open
for enlightenment, though.

3.2 Service Resources
---------------------

I'll say a bit more about those in the upcoming data service proposal,
but even if that is rejected, the proposed method of communicating which
column contains the IDs... ahem... has potential for beautification.
The ideal solution here would be a FIELDref with a utype that tells a
client it's the ID source.  

I don't want to clobber that, as I still have hopes we'll have proper
VO-DML accompanying a future version of Datalink, which will offer a
clean way of expressing this without having to do a lot of specification
work.

Meanwhile, we have to bring together a PARAM (presumably) with a field
reference.  I claim rather than just using a naked GROUP it's much more
straightforward to (ab-) use the LINK child of param.  This would mean
striking the text between "To call the service, the inner..." and "...in
the result table" and replacing it with something like:

  To determine which column in the result table the values for the ID
  parameters comes from, clients evaluate the xpath
  GROUP[@name="inputParams"]/PARAM[@name="ID"]/LINK[@content-role="ddl:id-source"]/@value.
  This contains a fragment identifier (including the hash, which means
  it is a valid relative URI) for the FIELD element describing
  the corresponding column in the primary result table.

Note that, again, once we have a proper modelling language in place,
accepted, and supported by libraries, this kind of ad-hoc hack won't be
necessary any more, so I'm not claiming that this is some sort of
precedent.

The example resource above could then be:

  <RESOURCE type="datalinkService">
    <GROUP name="inputParams">
      <PARAM arraysize="*" datatype="char" 
        name="ID" ucd="meta.id;meta.main" value="">
        <LINK content-role="ddl:id-source" value="#ssa_pubDID"/>
      </PARAM>
    </GROUP>
    <PARAM arraysize="*" datatype="char" 
      name="standardId" 
      value="ivo://ivoa.net/std/DataLink#links"/>
    <PARAM arraysize="*" datatype="char" 
      name="accessURL" 
      value="http://localhost:8080/data/ssatest/c/dlmeta"/>
  </RESOURCE>

[Incidentally: If anyone feels these things should be GROUPs rather than
RESOURCEs, you'd have my vote, but I don't think it matters much at this
point]

UCDs
----

I'd propose the following UCDs for the columns:

		ID               meta.id;meta.main
		accessURL        meta.ref.url
		serviceType      meta.code
		errorMessage     meta.code.error
		description      meta.note
		semantics        meta.code
		contentType      meta.code.mime
		contentLength    phys.size;meta.file

-- where I'd say we should really register new UCDs for accessURL ("the
URL a dataset can be retrieved at", meta.ref.accessURL, say), semantics
("a relationship between a dataset and a web resource",
meta.ref.relationType), and description ("a human-readable elaboration
on the nature of something", meta.description).

I'll suggest the need for several more UCDs in the data service
proposal, so there'd be no need to open a new UCD process just for
those.

I believe the UCDs should go into section 4, not section 5.1.1.

contentLength
-------------

I think the Description on 4.8 should more be something like

  The contentLength column contains an estimate of the amount of data
  that will be returned on retrival of accessURL.  An order-of-magnitude
  figure here is better than nothing, as it probably will not matter to a
  user very much whether they will be retrieving 40000 or 50000 Bytes.
  It probably will matter whether they will be retrieving 40 kB or 40 GB.

  contentLength is given in Bytes.  This must be reflected in the
  column metadata of the metadata response.

Abstract needs a bit more meat
------------------------------

Here's a suggestion for a somewhat enhanced abstract:

  Datalink is an IVOA defined protocol intended to allow access to
  artifacts connected to a dataset -- e.g., pieces of complex datasets,
  cutouts, processed and ancillary data, pieces of a dataset's
  provenance, renderings and previews -- behind just a single URL.  It
  thus works as an intermediate data access service that connects
  discovered datasets on the one hand and downloadable resources,
  services that can act upon the data files, and links to related
  resources on the other.  It is intended to be used in connection with
  IVOA data discovery services like Obscore/TAP, SIAP, or SSAP.

Suggestions for clarification
-----------------------------

I'd appreciate some language on what a service should do without
REQUEST.  Since the parameter is kinda superfluous in datalink, it's
tempting to just work without it, but of course that's a liability as it
may hide client bugs.

Then again, if we agree this is not a full DALI service, maybe we can do
away with REQUEST altogether?  IMHO that'd be a step forward (not only
in Datalink:-).

Typos
-----

Sect 1.2.3, "may be of the some" -> "...same"

Sect 1.2.5, "custom Uri" -> "...URI"
No FIELDRef in a convenient location, hence PARAM/LINK for pointer to
pubDID field.

Sect 1.2.6, "response (e.g., recursive" -> "... (i.e., ..."

Sect 4, "size of download" -- I'd rather have "size of resource" here.

Cheers,

          Markus