Datalink Feedback VI: Semantics

Tue Apr 8 10:09:35 PDT 2014

I completely agree with this:

- has to be useful for software to use semantcis to make decisions; the 
description is for people
- vocab should be maintained outside the DataLink specification

with the caveat that I don't have much of an opinion on how such a 
vocabulary should be maintained or what for it should take. It does seem 
like SKOS should be looked at first.

In the next revision, I will change the language here to refer to an 
external vocabulary. That requires a URL where 
people/developers/software can go and find "words"... Someone (Markus? 
Norman?) should propose a URL and put some minimal content there. Where 
do we write down the responsibility/rules for maintaining it?

Pat

On 07/04/14 10:59 AM, Markus Demleitner wrote:
> Just when you thought it'd be safe to read DAL again... here's
> another one.
>
> This time it's about 3.2.6, semantics, on which the WD says that it's
> a column containing "a single word (or comma-separated list?) from a
> small vocabulary that describes the meaning of this link relative to
> the dataset."
>
> First, I'd like to strongly suggest that a single word should do, as I'd
> say we've done a bad job in vocabulary construction if overlaps are
> common; also, people tend to get enumerations wrong (see, e.g.,
> content.level in VOResource).
>
> But my main concern is: what are the valid values here?  If it's a
> closed vocabulary (and I think that's almost the right design
> descision), we had better get this right.  Which here means: useful.
> Which begs the question: Useful for what?
>
> So: What are the use cases for semantics?
>
> I believe semantics should be for machines what description is for
> humans: It should allow machines a selection of links of interest to
> them, or to do a preselection based on what the user supposedly is
> interested in.
>
> What could the selection tasks be?
>
> One is clear to me: "Retrieve the full, original dataset."  I had
> suggested "self" as a term for that, and I stand by it.  "science" is
> IMHO too general for that, as it would cover things like "joined Echelle
> spectrum", too.
>
> For there on, it gets murky, and what I think we should do is go out
> to service operators and client writers and pipeline builders and
> explain things to them until they come up with use cases.  One thing
> that came up recently for me is "mask for contaminated areas of an
> image".  That's fairly common, and in an extraction or analysis task,
> it's useful to have, and a machine presumably could automatically do
> something with it.
>
> I suspect there are quite a few cases like this, but we don't have them,
> and new ones might come up.  I'd therefore suggest to have the
> vocabulary in an external resource mainained by the DAL chairs, a
> SKOS vocabulary, which is easy enough for clients to interpret.  They
> might thus still figure out that a mask is an auxillary file although
> it's never heard of mask as a term before (this is, e.g., for
> preselection on behalf of the user).
>
> Based on this and the "preselection on behalf of the user" use case,
> here's a proposal for the start of the vocabulary (with indentation
> meaning a narrower-than relation; I volunteer for turning this into
> well-formed SKOS if people agree this is where we want to go).
>
> self (the full main dataset)
> science (science data related to or generated from the main dataset)
>    derivation
>      source-list
>      joined-dataset (e.g., stacked images, joined spectrum)
>    source-file (something the current data was made from)
> calibration
> preview
> info
>    log
> auxillary
>    mask
>
> Maybe that's enough to get people (i.e., pipeline authors, client
> writers, all the later users of datalink) dreaming?
>
> Pat has also raised the question of whether the terms should actually be
> names of relations.  I believe this is inspired by RDF-like triples,
> which look somewhat like
>
> <entity1> <relation> <entity2>.
>
> Since entity2 could be "the link in this row" and entity1 "the dataset
> referred to in the id column", I might like this relation thing.  But
> then I didn't quite like that in the end.  Here's how that came
> about:  Suppose entity2 really is a big log of observation entries
> for the Z observatory.
>
> Now, what's <relation> in
>
> ivo://x.ogs/data?exposure1 <relation> the Z observatory log?
>
>
> is-logged-in?  has-an-entry-in? Don't like it, seems very artificial.
> And indeed, what we're talking about here simply is a file, a
> dataset, "a thing", and semantics IMHO shouldn't do more than say
> what kind of thing.  Hence, semantics should contain a noun,
> specifically, a noun that's narrower than "scientifically relevant
> data" or "service producing scientifically relevant data".  For that,
> we don't need RDF, the notion of triples, or relations.  The
> computationally much simpler plain vocabulary suffices.
>
> And I'd consider that good news.
>
> Cheers,
>
>              Markus
>

-- 

Patrick Dowler
Canadian Astronomy Data Centre
National Research Council Canada
5071 West Saanich Road
Victoria, BC V9E 2E7

250-363-0044 (office) 250-363-0045 (fax)