Datalink Feedback VI: Semantics

Mon Apr 7 10:59:42 PDT 2014

Just when you thought it'd be safe to read DAL again... here's
another one.

This time it's about 3.2.6, semantics, on which the WD says that it's
a column containing "a single word (or comma-separated list?) from a
small vocabulary that describes the meaning of this link relative to
the dataset."

First, I'd like to strongly suggest that a single word should do, as I'd
say we've done a bad job in vocabulary construction if overlaps are
common; also, people tend to get enumerations wrong (see, e.g.,
content.level in VOResource).

But my main concern is: what are the valid values here?  If it's a
closed vocabulary (and I think that's almost the right design
descision), we had better get this right.  Which here means: useful.
Which begs the question: Useful for what?

So: What are the use cases for semantics?

I believe semantics should be for machines what description is for
humans: It should allow machines a selection of links of interest to
them, or to do a preselection based on what the user supposedly is
interested in.

What could the selection tasks be?

One is clear to me: "Retrieve the full, original dataset."  I had
suggested "self" as a term for that, and I stand by it.  "science" is
IMHO too general for that, as it would cover things like "joined Echelle
spectrum", too.

For there on, it gets murky, and what I think we should do is go out
to service operators and client writers and pipeline builders and
explain things to them until they come up with use cases.  One thing
that came up recently for me is "mask for contaminated areas of an
image".  That's fairly common, and in an extraction or analysis task,
it's useful to have, and a machine presumably could automatically do
something with it.

I suspect there are quite a few cases like this, but we don't have them,
and new ones might come up.  I'd therefore suggest to have the
vocabulary in an external resource mainained by the DAL chairs, a
SKOS vocabulary, which is easy enough for clients to interpret.  They
might thus still figure out that a mask is an auxillary file although
it's never heard of mask as a term before (this is, e.g., for
preselection on behalf of the user).

Based on this and the "preselection on behalf of the user" use case,
here's a proposal for the start of the vocabulary (with indentation
meaning a narrower-than relation; I volunteer for turning this into
well-formed SKOS if people agree this is where we want to go).

self (the full main dataset)
science (science data related to or generated from the main dataset)
  derivation
    source-list
    joined-dataset (e.g., stacked images, joined spectrum)
  source-file (something the current data was made from)
calibration
preview
info
  log
auxillary
  mask

Maybe that's enough to get people (i.e., pipeline authors, client
writers, all the later users of datalink) dreaming?

Pat has also raised the question of whether the terms should actually be
names of relations.  I believe this is inspired by RDF-like triples,
which look somewhat like

<entity1> <relation> <entity2>.

Since entity2 could be "the link in this row" and entity1 "the dataset
referred to in the id column", I might like this relation thing.  But
then I didn't quite like that in the end.  Here's how that came
about:  Suppose entity2 really is a big log of observation entries
for the Z observatory.

Now, what's <relation> in 

ivo://x.ogs/data?exposure1 <relation> the Z observatory log?

is-logged-in?  has-an-entry-in? Don't like it, seems very artificial.
And indeed, what we're talking about here simply is a file, a
dataset, "a thing", and semantics IMHO shouldn't do more than say
what kind of thing.  Hence, semantics should contain a noun,
specifically, a noun that's narrower than "scientifically relevant
data" or "service producing scientifically relevant data".  For that,
we don't need RDF, the notion of triples, or relations.  The
computationally much simpler plain vocabulary suffices.

And I'd consider that good news.

Cheers,

            Markus