Datalink document questions

Sun May 18 06:09:08 PDT 2014

Markus, François and all, hello.

This discussion reached a natural pause for breath last month.  I now realise that the datalink session is happening simultaneously with the New Technologies session which I'm chairing, so I won't be able to join in at that point.  Therefore I'd like to add a couple of remarks here which I'd expected to add during the session.

On 2014 Apr 29, at 12:17, Markus Demleitner <msdemlei at ari.uni-heidelberg.de> wrote:

> On Mon, Apr 28, 2014 at 12:32:59PM +0100, Norman Gray wrote:
>> leaves me, at least, with an incomplete picture.  Say this
>> identifier is ivo://foo/bar or http://example.org/bar (I presume
>> the standard is agnostic about which type of URIs it's servicing), 
>> 
>>  1. Is ivo://foo/bar the identifier for the dataset, or ...
>>  2. ...the identifier for a bag of metadata about the dataset?
> 
> I'd strongly suggest it's the dataset.  I expect the standard
> identifier coming in will be the PubDID ("Dataset Identifier"), where
> already the name suggests that it's the dataset that's referenced.

That seems very sensible.  'PubDID' doesn't currently appear in the document, so that's a pointer that might be added in a future revision.

This might be the place to resolve (for example) just what PubDID is.

Francois said:

>>> in papers?  Or, in other terms, if one were to give the 'author' of
>>> ivo://foo/bar would it be referring to the scientist who generated
>>> the data, or the datacentre that assembled the {link} information?
>> The author of the dataset is the scientist.
> Yes, but the name is created by the curator, or the datacenter. PubDID is created by the curator. Two copies of the same dataset at two different data centers share the same author but have different PubDIDs.

That's not what I would have expected, nor what I'd hope the dataset identifier would identify.

It seems reasonable for those two copies to have different PubDIDs (I suppose), but that obscures the fact that they are the 'same' dataset.  Perhaps they could be called two 'instances' of the same dataset, for example.

That seems to fit naturally in with...

>> in papers?  Or, in other terms, if one were to give the 'author' of
>> ivo://foo/bar would it be referring to the scientist who generated
>> the data, or the datacentre that assembled the {link} information?
> 
> The author of the dataset is the scientist.  The publisher metadata
> occurs in the registry record of the datalink service, if people care
> to create one.

The scientist doesn't care which datacentre is distributing the data.  Also, since we find ourselves automatically talking about _the_ dataset, singular, then the intuition is surely that this thing should have a single name.

>> Markus mentioned (in passing) that the 'semantics' bit should
>> mention a 'self' link.  What would that point to? (...is yet
>> another version of this question).
> 
> The dataset.  Without this, a client has no way to find out where to
> retrieve it.

That makes sense.  That way, if you have this datalink object in isolation, you can find out which dataset it's describing.

>> * Sect 1.2.2: this appears to be talking about provenance -- should
>> it say so explicitly?  If the Datalink and Provenance efforts are
>> smart, Datalink will be able to use the Provenance work with no or
>> minimal extra work.
> 
> I'd prefer if we could leave the P-word out of the standard until we
> have 1.0 since P* appears to have a strong time-distortion field
> around it.

My intention in mentioning this was not to suggest that the datalink information should necessarily include Provenance information, but that it's a natural sort of thing to find here, and it would be good to make it easy to include that.

I think a particularly attractive feature of the datalink standard as it now is, is that it _is_ easy to see where this could be added.

>> * Sect 1.2.6: para 1 here seems to be describing something very
>> like PDL.  I know that that's intended for simulations, and that
>> one of MarkT's responses to the PDL TCG review was to hope that PDL
>> would be confined to theoretical services.  That said, this
>> paragraph appears to be saying so very clearly that there's an
>> analogous need for data services, that it starts to seem perverse
>> not to mention PDL.
> 
> My 2 cents on this is that PDL would ideally be developed into a
> VO-DML described data model for service parameters.  If this can, as
> I truly hope it will, be made to use VOTable PARAMs as atomic
> descriptors, then the existing datalink parameters can be annotated
> with PDL later on (just as they should be annotated with STC now).
> 
> There's a few too many ifs in there, though, for it to make it into a
> standard IMHO.

That's true, but I think my motivation for this remark was also, in part, as above.  This is a natural place for extra information to be attached, and making sure the door is open to a PDL-like thing would help the design stay flexible.

>> Sect 4.1: this section is about linking between this Datalink
>> element and other resources (yes?). If this is about linking to
>> other resources, why isn't it being done in the "{links}" table?
>> That sort of linking appears to be exactly what Datalink is about,
>> so I'm a bit confused about why this extra section is here, talking
>> about an apparently completely different linking mechanism.  (not
>> to mention that it's a seriously complicated/confusing mechanism)
> 
> Hm -- if this confuses even you, we need to do a much better job
> explaining what this is about.

I'll revisit that part of the document and perhaps come up with some more specific criticisms.

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK