Datalink Feedback VI: Semantics

Tue Apr 22 02:39:23 PDT 2014

Hi Norman, dear DAL folks,

Sorry, it's long again.  This may be too subtle for general tastes,
so the TL;DR is: We need use cases for semantics, and for those I
have, we can keep it simpler than Norman suggests; but none of this
is a showstopper as far as I am concerned.

On Thu, Apr 17, 2014 at 05:11:33PM +0100, Norman Gray wrote:
> The short/blunt version of my remarks below is:
> 
>   1. I think the 'semantics' column should be URIs, not just a
>   simple string from a pre-specified set.
> 
>   2. They should be conceived of as predicates, not types or SKOS concepts.

Hmmmmm.  I'm afraid I don't quite share those two assessments.  I
*believe* that's due to the restricted use cases I have in mind.  To
recap, these are

(1) Pre-filtering or sorting links for the user
(2) Identifying the full data set
(3bis) (half use-case as I don't see an implementation forthcoming): Use a
  mask in an extraction task or something similar.

I'm sure we should be gathering more of those, but while that's all
there is, I think we shouldn't build anything more complicated than
what's required to support that.

So:

> On 2014 Apr 7, at 18:59, Markus Demleitner <msdemlei at ari.uni-heidelberg.de> wrote:
> 
> > meaning a narrower-than relation; I volunteer for turning this into
> > well-formed SKOS if people agree this is where we want to go).
> 
> Why SKOS?

The strongest reason: It's already being used for fairly similar
things in the VO, and I think it's a good idea to keep the number of
semantics technologies in the VO as low as possible.  We already have
at least UCDs (flat list with a bit of grammar to combine terms) and
SKOS (simple terms in something close to a hierarchy).  Unless
there's a compelling case, I'd say we should try to avoid adding to
this list.

> Classification, or Library of Congress classification.  That is: "I
> want to find a book about cats: which shelf of the library do I go
> to browse for one?"

Well, that's use case (1): "In this situation, the user won't be
interested in files used in calibration, so I'm not showing them";
"In this situation, I only want to show science data"; "I always show
the file itself on top, then any science data, then any calibration
data, then...".  If we expect that more terms will be added later on,
then the hierarchy is necessary to give only clients a chance to
figure things out (or to be quickly and automatically updated).

>     access_url = http://example.ac.uk/mycalibration.fits
>     semantics = http://example.ivoa.net/datalink/skos#Calibration
> 
> If you're taking SKOS seriously, then you're saying that
> <http://example.ac.uk/mycalibration.fits> is related to the idea of
> 'Calibration' (perhaps it's a book about calibration
> methodologies).  But even if you guess from context that it's
> actually a calibration file, all this says is that

That's not a guess.  This is a datalink table, and so each row
corresponds to a (possibly parameterizable) bytestream representing
something related to the particular dataset, which, in this day and
age, passes for "thing".  Therefore I'd claim:

> SKOS isn't binary or relational -- it's an annotation of a thing.

makes SKOS spot-on IMHO.

> 
> Also, SKOS concepts don't have instances -- you can't say 'X isA
> http://example.ivoa.net/datalink/skos#Calibration'.

That I'd consider a valid objection, and it'd almost suffice to
convince me.  In particular, it'd be a strong point for allowing
multiple annotations, which I really don't like.

So -- I give you this point, and thinking along these lines we might
find out why we should really have something else than a single SKOS
term per semantics field.  Until then, I can live with interpreting

id             semantics     access_url
ivo://foo/bar  calibration   http://foo.com/bar-flat.fits
ivo://foo/bar  calibration   http://foo.com/bar-dark.fits
ivo://foo/bar  documentation http://foo.com/calibration-guide.pdf

as

bar-flat.fits and bar-dark.fits are files used in the calibration of
ivo://foo/bar.  calibration-guide.pdf is a  file that's some sort of
documentation.  But you're right, it doesn't feel *quite* right.  Can
we have a use case where this would blow up?

> > Pat has also raised the question of whether the terms should actually be
> > names of relations.  I believe this is inspired by RDF-like triples,
> > which look somewhat like
> > 
> > <entity1> <relation> <entity2>.
> > 
> In this view, the 'semantics' relation is indeed a predicate.  Saying
> 
>     access_url = http://example.ac.uk/mycalibration.fits
>     semantics = http://example.ivoa.net/datalink/info#hasCalibration
> 
> has a very natural and fully robust interpretation as 
> 
> http://example.ac.uk/myfile hasCalibration http://example.ac.uk/mycalibration.fits
> 
> If you want to have a tree of such relations, you can.  It would
> require a little more thought than the SKOS tree mentioned above,
> but nothing major.

Well, Lexical semantics of terms working as relations -- two-argument
verbs or verb phrases with an empty argument -- is *much* harder than
with nouns (ask the folks who do Wordnet or Cyc) unless your
relations are essentially just is-<noun> in the first place.  In
which case you could have kept SKOS.

> For example, I'm a bit uncertain, reading the
> WD-DataLink-1.0-20140228 document and the DALI spec to which it
> links: what is the URL, that the {links} list gives the access_url
> of?  What (in the above example) is the URL that
> http://example.ac.uk/mycalibration.fits is the calibration of?  Is
> it an IVORN naming the dataset?  An example would be useful, in the
> document.

Well, the WD says:

  Each row in the table represents one link and must have either an
  access_url or an error_message. Normally, if there is an
  error_message, there should be only...

Since, as the Zen of Python states, explicit is better than implicit,
I guess this could be changed to

  Each row in the table represents either a reference to a resource
  related to the dataset referred to by ID or a failure to produce
  such a reference.  In the first case, access_url is non-NULL and
  error_message is NULL, in the second case access_url is NULL and
  error_message contains a string subject to the constraints given
  below.

> > And indeed, what we're talking about here simply is a file, a
> > dataset, "a thing", and semantics IMHO shouldn't do more than say
> > what kind of thing.  Hence, semantics should contain a noun,
> > specifically, a noun that's narrower than "scientifically relevant
> > data" or "service producing scientifically relevant data".
> 
> That's a _big_ 'hence'!  You could potentially say that the
> 'semantics' column indicates a class _and_ that the thing at the
> end of the access_url link isA member of that class.  In that case,
> these terms should (probably) be nouns.  Not otherwise.

Yes, that's basically what I had in mind, and that's why your
objection above somewhat hits me.

> Something that occurs to me: is there any way of saying, within the
> Datalink framework, "this data was observed by 'Norman Gray'"?
> That is, a relation to a string (or similar) rather than a URL.

No, and I'm very much against teaching it to do anything like it
(unless of course a client might usefully download "Norman Gray":-).
We don't want to solve provenance here, we want to annotate files
having something to do with a given dataset.

> > For that,
> > we don't need RDF, the notion of triples, or relations.  The
> > computationally much simpler plain vocabulary suffices.
> 
> Where's the computational complexity?  I worry we're at cross-purposes.

Ok, I retract the claim about the computational complexity.  As to
conceptual complexity, see above: VO folks might already have learned
SKOS.  Use case (1) already wants hiearchy, so we'll need that.  Do
we really want to teach people that already know how to deal with
SKOS how to now state rdfs:subPropertyOf, and where to find them and
how to parse them?

> Pat said:
> 
> > In the next revision, I will change the language here to refer to
> > an external vocabulary. That requires a URL where
> > people/developers/software can go and find "words"... Someone
> > (Markus? Norman?) should propose a URL and put some minimal
> > content there. Where do we write down the responsibility/rules
> > for maintaining it?
> 
> The namespace URL could be something like
> <http://www.ivoa.net/Documents/20140228/relations>, though it would
> probably be better to use the <http://www.ivoa.net/rdf/> tree,
> which was intended for things like this.  The document at the end
> of that doesn't have to be complicated.  The ones for the
> Vocabularies spec are basically just text files, each generated by
> a python script from a separate master file in the Volute
> repository.  Nothing fancy, and potentially updatable on any
> schedule you like.

Uh, I'm totally against namespacing these things.  A good part of the
gripes people have our VO registry has the subtleties of XML
namespaces and their management between VO players at their heart.  I
often wish we had back then said: "Ok, here's a master XSD of VO
registry types.  If you need more, here's a lightweight process you
go through to enter that file."

I also claim we don't need versioning of the terms in semantics (yes,
they are somewhat vague, but I claim that's still enough to cover the
use cases), and that on the other hand  versioning will be a huge
pain for all involved, with what will actually work discarding the
version information and thus making this worse than if there's be no
expectation of versioning in the first place.

In particular with the hierarchies it's just *so* much easier if
clients can go to *one* place and get the *entire* list in one,
easy-to-parse format, if they can even embed everything they have to
reasonably expect at any given moment.

So, unless someone comes up with a very strong use case where we
need URLs rather than simple terms, I'd urge to keep simple strings.

So......

Where do we go from here?  I'll listen to a "stop it, you're all
wrong" or "here's a use case that your naive SKOS thing can't cover."
I don't consider any of my objections terribly strong, and I'm sure
Norman's plan would work out, too.

However, as we need a standards text fairly soon, and as I'm sure going
with SKOS will do all I can see us wanting: if nobody protests or
writes some standardese themselves, I'll go ahead late this week and
write two paragraphs (presumably a gross simplification of
http://www.ivoa.net/documents/REC/UCD/UCDlistMaintenance-20060528.html)
and the SKOS vocabulary with plain terms. and post it here.

Cheers,

            Markus