Datalink Feedback VI: Semantics

Thu Apr 17 09:11:33 PDT 2014

Markus and all, hello.

I'm a little late joining in this discussion -- I partly wanted to avoid butting in too early.

The short/blunt version of my remarks below is:

  1. I think the 'semantics' column should be URIs, not just a simple string from a pre-specified set.

  2. They should be conceived of as predicates, not types or SKOS concepts.

  3. (I agree that) the standard/preferred/initial/understanding-required set should be maintained separately from the Datalink document.

I'll come back to any remaining specific rationale for these on the other side of some interlineal comments to Markus's and others' remarks.

On 2014 Apr 7, at 18:59, Markus Demleitner <msdemlei at ari.uni-heidelberg.de> wrote:

> So: What are the use cases for semantics?

I think the list here, and the list in section 1.2 of the Datalink WD, are sufficiently long and varied as to dismiss any prospect of the Datalink authors deciding on a useful closed set of 'semantics' terms.  There will _always_ be cases that haven't been thought of, or someone wanting to draw finer distinctions than the Datalink authors' compromises.

That doesn't require a free-for-all.  Leaving the list potentially open-ended still allows the authors to provide a list of terms which have fixed meanings, and which applications are required to understand (for example).

> Based on this and the "preselection on behalf of the user" use case,
> here's a proposal for the start of the vocabulary (with indentation
> meaning a narrower-than relation; I volunteer for turning this into
> well-formed SKOS if people agree this is where we want to go).
> 
> self (the full main dataset)
> science (science data related to or generated from the main dataset)
>  derivation
>    source-list
>    joined-dataset (e.g., stacked images, joined spectrum)
>  source-file (something the current data was made from)
> calibration
> preview
> info
>  log
> auxillary
>  mask

Why SKOS?

I'm not saying its wrong, but remember that SKOS is for rather vague associations of things with concepts, and broadly for searching or browsing.  The paradigmatic example of a thesaurus (which is what SKOS is) is something like the Dewey Decimal Classification, or Library of Congress classification.  That is: "I want to find a book about cats: which shelf of the library do I go to browse for one?"

More specifically to this case, suppose that you retrieved a {links} resource related to <http://example.ac.uk/myfile> which referred to

    access_url = http://example.ac.uk/mycalibration.fits
    semantics = http://example.ivoa.net/datalink/skos#Calibration

If you're taking SKOS seriously, then you're saying that <http://example.ac.uk/mycalibration.fits> is related to the idea of 'Calibration' (perhaps it's a book about calibration methodologies).  But even if you guess from context that it's actually a calibration file, all this says is that <http://example.ac.uk/mycalibration.fits> is potentially _a_ calibration file, and doesn't say anything (except by vague implication) about which dataset it's the calibration of.

SKOS isn't binary or relational -- it's an annotation of a thing.

Also, SKOS concepts don't have instances -- you can't say 'X isA http://example.ivoa.net/datalink/skos#Calibration'.

OK, you say, but you sort of... kind of... know what I mean, don't you?  Well yes, I probably do, but that vagueness isn't a great place to start out.

> Pat has also raised the question of whether the terms should actually be
> names of relations.  I believe this is inspired by RDF-like triples,
> which look somewhat like
> 
> <entity1> <relation> <entity2>.
> 
> Since entity2 could be "the link in this row" and entity1 "the dataset
> referred to in the id column", I might like this relation thing.  But
> then I didn't quite like that in the end.  Here's how that came
> about:  Suppose entity2 really is a big log of observation entries
> for the Z observatory.

In this view, the 'semantics' relation is indeed a predicate.  Saying

    access_url = http://example.ac.uk/mycalibration.fits
    semantics = http://example.ivoa.net/datalink/info#hasCalibration

has a very natural and fully robust interpretation as 

http://example.ac.uk/myfile hasCalibration http://example.ac.uk/mycalibration.fits

If you want to have a tree of such relations, you can.  It would require a little more thought than the SKOS tree mentioned above, but nothing major.

And yes, this is obviously inspired by the RDF model; but that's no bad thing.  Even if you are sure you will never re-phrase such a {links} resource as RDF, ensuring that you _could_ do so, in a straightforward and unambiguous way, guarantees that what you have is well thought through, and that you have asked and answered some pertinent questions about 'just what do you mean, here?'

For example, I'm a bit uncertain, reading the WD-DataLink-1.0-20140228 document and the DALI spec to which it links: what is the URL, that the {links} list gives the access_url of?  What (in the above example) is the URL that http://example.ac.uk/mycalibration.fits is the calibration of?  Is it an IVORN naming the dataset?  An example would be useful, in the document.

> Now, what's <relation> in 
> 
> ivo://x.ogs/data?exposure1 <relation> the Z observatory log?
> 
> 
> is-logged-in?  has-an-entry-in? Don't like it, seems very artificial.

How is this artificial?  If I want to find out more details about ivo://x.ogs/data?exposure1 then I do want to know where this exposure is logged, so 'is-logged-in', or 'has-log-entry-in', or however you want to spell it, is precisely the relationship I'm looking for.

> And indeed, what we're talking about here simply is a file, a
> dataset, "a thing", and semantics IMHO shouldn't do more than say
> what kind of thing.  Hence, semantics should contain a noun,
> specifically, a noun that's narrower than "scientifically relevant
> data" or "service producing scientifically relevant data".

That's a _big_ 'hence'!  You could potentially say that the 'semantics' column indicates a class _and_ that the thing at the end of the access_url link isA member of that class.  In that case, these terms should (probably) be nouns.  Not otherwise.

Remember that a predicate may well rigidly imply the type of the thing it's pointing to, so there's no loss in a 'type' sense, to regarding these as predicates.

Something that occurs to me: is there any way of saying, within the Datalink framework, "this data was observed by 'Norman Gray'"?  That is, a relation to a string (or similar) rather than a URL.

> For that,
> we don't need RDF, the notion of triples, or relations.  The
> computationally much simpler plain vocabulary suffices.

Where's the computational complexity?  I worry we're at cross-purposes.

Pat said:

> In the next revision, I will change the language here to refer to an external vocabulary. That requires a URL where people/developers/software can go and find "words"... Someone (Markus? Norman?) should propose a URL and put some minimal content there. Where do we write down the responsibility/rules for maintaining it?

The namespace URL could be something like <http://www.ivoa.net/Documents/20140228/relations>, though it would probably be better to use the <http://www.ivoa.net/rdf/> tree, which was intended for things like this.  The document at the end of that doesn't have to be complicated.  The ones for the Vocabularies spec are basically just text files, each generated by a python script from a separate master file in the Volute repository.  Nothing fancy, and potentially updatable on any schedule you like.

Francois said:

>     By the way, if we want to refine the relationship between the dataset and what is retrieved through the link we could use the concept hierarchical refinment to do it.
>      For example:
>                science
>                        catalog
>                               external_reference_catalog
>      could be different relationship to the daset than
>                science
>                       catalog
>                              extracted_sources

Those look more like predicates to me (ie, binary relations with respect to the thing this is the Datalink description of).  

I don't think there's a _need_ for hierarchy in this case, but if it's felt desirable, on aesthetic grounds or in order to hold the door open to more sophisticated use in the future, then spelling the relationship rdfs:subPropertyOf doesn't seem intrinsically more complicated than spelling it skos:broader.

----

So, returning to...

  1. I think the 'semantics' column should be URIs, not just a simple string from a pre-specified set.

URIs give free namespacing, free future expansion, extensibility, etc etc etc.  If the 'semantics' column is unrestricted (ie, not restricted to simply Datalink-approved URIs), then a service would be free to include, for example, http://example.edu/datalink/ourRelations#originatingFundingProposal as a 'semantics' relation which points to the proposal associated with this data.  Most clients wouldn't recognise that, but (a) those that did might find it useful, and (b) those that didn't might decide to start.

  2. They should be conceived of as predicates, not types or SKOS concepts.

I hope I've given some food for thought here.

  3. (I agree that) the standard/preferred/initial/understanding-required set should be maintained separately from the Datalink document.

This would be easy to arrange.

Oh, a final point: reference [1] points to http://www.ivoa.net/std/DALI/ which is 404.  Do you mean <http://www.ivoa.net/documents/DALI/>?

I hope this is all useful.  I'd be very happy to expand on this stuff.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK