Datalink document questions

Tue Apr 29 03:17:02 PDT 2014

Dear DAL group,

On Mon, Apr 28, 2014 at 12:32:59PM +0100, Norman Gray wrote:
> leaves me, at least, with an incomplete picture.  Say this
> identifier is ivo://foo/bar or http://example.org/bar (I presume
> the standard is agnostic about which type of URIs it's servicing), 
> 
>   1. Is ivo://foo/bar the identifier for the dataset, or ...
>   2. ...the identifier for a bag of metadata about the dataset?

I'd strongly suggest it's the dataset.  I expect the standard
identifier coming in will be the PubDID ("Dataset Identifier"), where
already the name suggests that it's the dataset that's referenced.

> in papers?  Or, in other terms, if one were to give the 'author' of
> ivo://foo/bar would it be referring to the scientist who generated
> the data, or the datacentre that assembled the {link} information?

The author of the dataset is the scientist.  The publisher metadata
occurs in the registry record of the datalink service, if people care
to create one.

> The same point goes for the (Sect. 3.2) 'description' -- is this
> describing the dataset or the metadata?

If you refer to 3.2.5, that one I'd consider pretty clear -- as all
other items in each row, description pertains to the "thing"
referenced by the row (except that for ID, you'd have to introduce
the special interpretation "dataset that this is related to").

> Markus mentioned (in passing) that the 'semantics' bit should
> mention a 'self' link.  What would that point to? (...is yet
> another version of this question).

The dataset.  Without this, a client has no way to find out where to
retrieve it.

> * Sect 1.2.2: this appears to be talking about provenance -- should
> it say so explicitly?  If the Datalink and Provenance efforts are
> smart, Datalink will be able to use the Provenance work with no or
> minimal extra work.

I'd prefer if we could leave the P-word out of the standard until we
have 1.0 since P* appears to have a strong time-distortion field
around it.

But then I agree there's a relationship between P* and Datalink,
although I'd rather see Datalink as an enabler for P* -- after all,
in P* we need to refer to predecessors, and having some way to say

Thing with <label> belonging to <PubDID>

would at least look more persistent than a simple URL.

But I'd still suggest we shelve all this until a P* DM actually says
how they'd like it.  I *suspect* an additional label column would fit
the bill, but in P* things never are as simple as they seem, so let's
steer datalink clear of the time-distortion field around
Prov...e.....n..............

> * Sect 1.2.6: para 1 here seems to be describing something very
> like PDL.  I know that that's intended for simulations, and that
> one of MarkT's responses to the PDL TCG review was to hope that PDL
> would be confined to theoretical services.  That said, this
> paragraph appears to be saying so very clearly that there's an
> analogous need for data services, that it starts to seem perverse
> not to mention PDL.

My 2 cents on this is that PDL would ideally be developed into a
VO-DML described data model for service parameters.  If this can, as
I truly hope it will, be made to use VOTable PARAMs as atomic
descriptors, then the existing datalink parameters can be annotated
with PDL later on (just as they should be annotated with STC now).

There's a few too many ifs in there, though, for it to make it into a
standard IMHO.

> Sect 3.2.2: Is the access_url cacheable?  At one extreme this could
> be just a URL for an FTP service, or something like that; at
> another, this could be a staged file with an unpredictable URL that
> will disappear in some short period.  I think it makes good sense
> both ways, but it might be worth a sentence discussing this.

I'd not want to write that.  HTTP has lots and lots of language on
this, and there's so much subtlety to this that anything not
basically just pointing there would probably break more than it would
fix.

As to non-http URLs, I'd be unconcerned.  For most of these the issue of
cacheability is more or less implied.

So: I'd keep my mouth shut on all of this.

> Sect 4.1: this section is about linking between this Datalink
> element and other resources (yes?). If this is about linking to
> other resources, why isn't it being done in the "{links}" table?
> That sort of linking appears to be exactly what Datalink is about,
> so I'm a bit confused about why this extra section is here, talking
> about an apparently completely different linking mechanism.  (not
> to mention that it's a seriously complicated/confusing mechanism)

Hm -- if this confuses even you, we need to do a much better job
explaining what this is about.

Let me try: The problem domain to solve includes stuff like adding
cut-out, format conversion, normalization, or other server-side
manipulation capabilities to DAL (S*AP, possibly even TAP) services.
For instance, you (as a client) do a SIA query, see "ah, this service
can return scaled jpegs of cutouts", and then retrieve those as
previews until your user has found what she's looking for.  Or you
see that the products  are really cubes, and you ask the user for the
wavelengths contained  and only retrieve data for those.

Doing this by handing out datalinks rather than links to the actual
datasets from the DAL protocol could solve this as well, but it will
make life much harder for trivial clients, and it'd also require two
accesses per retrieval, as in such an architecture there's no knowing
any systematics in the datalink description ("are the parameters all the
same for all the datasets?").

What exactly do you find complicated and confusing here?  The only
thing that's added wrt what's in {links} is that here you need to say
where the ID should be coming from as we don't want to make too many
assumptions about the structure of the primary response (which, after
all, could be the result of an obscore query).

Cheers,

          Markus