Datalink Feedback VI: Semantics

Wed Apr 23 08:29:23 PDT 2014

Markus and all, hello.

The following email is quite 'bitty', consisting of rejoinders to Markus's points rather than assembling a continuous argument.

Short version:

  * I still maintain that SKOS is too vague to be useful for the intended purpose here, unless we add Datalink-standard-only meanings, and I try to elaborate this below.

  * The set of use-cases seems still very open-ended and fuzzy, so that I think it would be inappropriate to settle for a solution which can satisfy _only_ the core use-cases.

  * The RDF model perfectly matches what's being aimed at here.  In that, simple things are simple, but future extensions will be possible.  RDF isn't complicated and, to a first approximation here, consists of respelling skos:broader as rdfs:subPropertyOf.

On 2014 Apr 22, at 10:39, Markus Demleitner <msdemlei at ari.uni-heidelberg.de> wrote:

> On Thu, Apr 17, 2014 at 05:11:33PM +0100, Norman Gray wrote:
>> The short/blunt version of my remarks below is:
>> 
>>  1. I think the 'semantics' column should be URIs, not just a
>>  simple string from a pre-specified set.
>> 
>>  2. They should be conceived of as predicates, not types or SKOS concepts.
> 
> Hmmmmm.  I'm afraid I don't quite share those two assessments.  I
> *believe* that's due to the restricted use cases I have in mind.  To
> recap, these are
> 
> (1) Pre-filtering or sorting links for the user
> (2) Identifying the full data set
> (3bis) (half use-case as I don't see an implementation forthcoming): Use a
>  mask in an extraction task or something similar.
> 
> I'm sure we should be gathering more of those, but while that's all
> there is, I think we shouldn't build anything more complicated than
> what's required to support that.

Are those really all of the use-cases?  Your original message in this thread (7 April) suggested that there was one clear use-case, but a 'murky' prospect thereafter, and the expectation that there will be a longer list of requirements which will be extracted from service operators.  Pat also said (as I interpret his message yesterday) that there are some clear use-cases (which match the ones above), but also an uncertain list of use-cases in the slightly longer term.

Just by the way: I worry that we fetishise use-cases in the IVOA.  They're useful for scoping, for making actions concrete, for identifying _non_-usecases, and so on, and of course they're useful for reining back wild architectural visions and over-engineering (YAGNI, and all that).  But in this case the set of use-cases seems fuzzy round the edges, and is likely to grow, to a large enough extent that a solution that imperfectly matches _only_ the core cases (and I claim that's the case for SKOS) seems to deny any breathing space for later growth.

>> Why SKOS?
> 
> The strongest reason: It's already being used for fairly similar
> things in the VO, and I think it's a good idea to keep the number of
> semantics technologies in the VO as low as possible.  We already have
> at least UCDs (flat list with a bit of grammar to combine terms) and
> SKOS (simple terms in something close to a hierarchy).  Unless
> there's a compelling case, I'd say we should try to avoid adding to
> this list.

The UCDs are possibly a poor example.  They're barely structured, and all they can really do is gesture towards a linked concept in a way that is of (important0 heuristic use, but no more.  They are pretty naturally represented as a SKOS thesaurus; but in the SKOSified version of the UCDs, the broader/narrower relations are largely fake.

>>    access_url = http://example.ac.uk/mycalibration.fits
>>    semantics = http://example.ivoa.net/datalink/skos#Calibration
>> 
>> If you're taking SKOS seriously, then you're saying that
>> <http://example.ac.uk/mycalibration.fits> is related to the idea of
>> 'Calibration' (perhaps it's a book about calibration
>> methodologies).  But even if you guess from context that it's
>> actually a calibration file, all this says is that
> 
> That's not a guess.  This is a datalink table, and so each row
> corresponds to a (possibly parameterizable) bytestream representing
> something related to the particular dataset, which, in this day and
> age, passes for "thing".  Therefore I'd claim:

But in that case you're not using SKOS any more, but instead some hybrid system which uses SKOS-flavoured URLs given a custom meaning by the text in the standard document.

What that means in turn is that if someone extracts these fields from the Datalink response:

   access_url = http://example.ac.uk/mycalibration.fits
   semantics = http://example.ivoa.net/datalink/skos#Calibration

...and stores those access and meaning URLs in a database (for whatever reason -- perhaps to pass on, perhaps to use as a cache of some type), then they have to _add_ the indication that this information came from a Datalink response, in order that they be not interpreted as 'normal' SKOS, but with a special SKOS-plus-datalink meaning.

If, instead, the standard were to indicate that 

    access_url = http://example.ac.uk/mycalibration.fits
    semantics = http://example.ivoa.net/datalink/info#hasCalibration

    means, in the RDF sense,

    http://example.ac.uk/myfile hasCalibration http://example.ac.uk/mycalibration.fits

then I can use this information anywhere else, or serialise it into, or deserialise it from, something completely different, confident that it'll mean the same thing.  That 'something different' might be some TBD VOTable pattern, or some weirdo pattern of FITS cards.

The point here is not to sneak some feature-creep into the Datalink design, but to say that ensuring the door is open to support for still-unarticulated 'murky' use-cases, is to ensure that the design is clean and well-defined for the simple cases.

> So -- I give you this point, and thinking along these lines we might
> find out why we should really have something else than a single SKOS
> term per semantics field.  Until then, I can live with interpreting
> 
> id             semantics     access_url
> ivo://foo/bar  calibration   http://foo.com/bar-flat.fits
> ivo://foo/bar  calibration   http://foo.com/bar-dark.fits
> ivo://foo/bar  documentation http://foo.com/calibration-guide.pdf
> 
> as
> 
> bar-flat.fits and bar-dark.fits are files used in the calibration of
> ivo://foo/bar.  calibration-guide.pdf is a  file that's some sort of
> documentation.  But you're right, it doesn't feel *quite* right.  Can
> we have a use case where this would blow up?

How about: you later realise that you want to distinguish flats from darks in the metadata, so want to say in fact

    id             semantics     access_url
    ivo://foo/bar  flat http://foo.com/bar-flat.fits
    ivo://foo/bar  dark   http://foo.com/bar-dark.fits
    ivo://foo/bar  documentation http://foo.com/calibration-guide.pdf

...and have this still be interpretable (after re-issuing the vocabulary, but without having to re-issue the standard) as 'flat' and 'dark' both being 'calibration'.

You can write that as

    http://example.org/flat skos:broader http://example.org/calibration
or as
    http://example.org/flat rdfs:subPropertyOf http://example.org/calibration

This would make an important difference to the minority of users who happen to know the difference between SKOS and RDFS, but for everyone else, they're just two trivially different ways of spelling 'up a bit in the tree'.  The important thing is that the latter is much better defined, and actually says what you mean, rather than hinting towards it.  That doesn't seem to be a dangerous proliferation of semantic technologies.

>> If you want to have a tree of such relations, you can.  It would
>> require a little more thought than the SKOS tree mentioned above,
>> but nothing major.
> 
> Well, Lexical semantics of terms working as relations -- two-argument
> verbs or verb phrases with an empty argument -- is *much* harder than
> with nouns (ask the folks who do Wordnet or Cyc) unless your
> relations are essentially just is-<noun> in the first place.  In
> which case you could have kept SKOS.

What you say is true _if_ you're dealing with natural languages, or the sort of natural-language vagueness that the Wordnet and Cyc people have to deal with.  But we're not.  We're dealing with relatively simple statements of the relationships between data objects.  You _might_ want to say something indirect like "this data was written by the person whose email address is foo at example.org", but probably you'll just want to say dataset:id foo:author http://example.org/data-creation-project.  A triples-based approach can say the former easily, and can say the latter if necessary (and there's a decade of work to prove that it can do this adequately); a SKOS-based approach can barely say the former.

That neatly brings us to...

> Something that occurs to me: is there any way of saying, within the
>> Datalink framework, "this data was observed by 'Norman Gray'"?
>> That is, a relation to a string (or similar) rather than a URL.
> 
> No, and I'm very much against teaching it to do anything like it
> (unless of course a client might usefully download "Norman Gray":-).
> We don't want to solve provenance here, we want to annotate files
> having something to do with a given dataset.

Will you _never_ want to even touch on provenance here?  If the Datalink effort and the Provenance effort play their cards right (by which I mean that they always ask 'how might I turn this into triples?' even if there's never an RDF angle bracket in sight), then it will be very easy to mix and match the two efforts.  You'd be able to use Provenance terms in a Datalink response (it won't hurt, and might be useful downstream), and vice versa.  All for free.

>> Pat said:
>> 
>>> In the next revision, I will change the language here to refer to
>>> an external vocabulary. That requires a URL where
>>> people/developers/software can go and find "words"... Someone
>>> (Markus? Norman?) should propose a URL and put some minimal
>>> content there. Where do we write down the responsibility/rules
>>> for maintaining it?
>> 
>> The namespace URL could be something like
>> <http://www.ivoa.net/Documents/20140228/relations>, though it would
>> probably be better to use the <http://www.ivoa.net/rdf/> tree,
>> which was intended for things like this.  The document at the end
>> of that doesn't have to be complicated.  The ones for the
>> Vocabularies spec are basically just text files, each generated by
>> a python script from a separate master file in the Volute
>> repository.  Nothing fancy, and potentially updatable on any
>> schedule you like.
> 
> Uh, I'm totally against namespacing these things.  A good part of the
> gripes people have our VO registry has the subtleties of XML
> namespaces and their management between VO players at their heart.  I
> often wish we had back then said: "Ok, here's a master XSD of VO
> registry types.  If you need more, here's a lightweight process you
> go through to enter that file."

Then don't call them namespaces -- just call them URLs.

'Namespaces' have a very bad reputation because of the way they're handled in XML, where they're ugly and confusing and don't seem to give a huge amount of long-term benefit.

It's surely useful to be able to say that http://example.org/calibration and http://example.edu/calibration are different things, or to allow the example.edu people to start using _their_ sense of calibration where it's useful for their processes, without them having to go back to the IVOA to get an extension to the Datalink Vocabulary.

It doesn't mean that everyone has to understand all of the terms.  If you give out data where you _only_ use your own private terms, then you're being silly, but if you use your private terms alongside generally understood ones (such as the Datalink Standard List Of Terms), then you're adding value.

> I also claim we don't need versioning of the terms in semantics (yes,
> they are somewhat vague, but I claim that's still enough to cover the
> use cases), and that on the other hand  versioning will be a huge
> pain for all involved, with what will actually work discarding the
> version information and thus making this worse than if there's be no
> expectation of versioning in the first place.

The best answer I have here is that http://iau.example.org/terms/Planet isn't versioned _as a term_, and nor should it be.  It's _meaning_ may be different at different times, so the _explanation_ may be versioned, but the term itself isn't.

This is also a good point to note again that if the terms are URLs, then the obvious place to put documentation for them is at the end of that URL.  Don't know what a term means?  Paste it into a browser and find out.

> In particular with the hierarchies it's just *so* much easier if
> clients can go to *one* place and get the *entire* list in one,
> easy-to-parse format, if they can even embed everything they have to
> reasonably expect at any given moment.
> 
> So, unless someone comes up with a very strong use case where we
> need URLs rather than simple terms, I'd urge to keep simple strings.

It would certainly be easier if we could solve the entire problem in one go; if we could determine all of the metadata that anyone will ever need, or at least do so well enough that revisions could be managed through the usually glacial IVOA process.  But I'm not confident that's possible.

Soapbox bit: What I feel is the biggest, clearest, most shining Win for the RDF/triples/open-world approach (as an approach, rather than a species of angle-brackets) is that it allows you to have very cheap extensibility -- future-proofing -- without making simple things anything other than simple.  If a client understands a term, good; if it doesn't, it ignores it; if the result is unintelligible, it complains.  If you want to get fancy, you can get the client to mechanically do some digging on what the term means ("oh, you just mean 'calibration'? fine..."), but that sort of thing isn't necessary.

> Where do we go from here?  I'll listen to a "stop it, you're all
> wrong" or "here's a use case that your naive SKOS thing can't cover."
> I don't consider any of my objections terribly strong, and I'm sure
> Norman's plan would work out, too.

To the extent that I'm proposing a Plan, I suppose it'd be this:

  1. the items in the Datalink 'semantics' column are URLs which are taken to be RDF Properties

  2. the language in Sect. 3.2.6 of the Datalink spec indicates in a sentence that each row of this list of links (which has a 'semantics' entry) is interpretable  as an RDF triple

  3. any URL is permitted in the 'semantics' column, and clients are not required to understand any of them, but...

  4. there is a parallel Datalink Vocabulary document, in preparation, which defines some 'blessed' ones. which it would be silly for a client not to understand.

Re 1: I think that 'relation' or 'predicate' would be a better name than 'semantics'.  Also, the idea of an RDF 'Property' (aka 'Predicate') is simply that it's the link between a 'subject' and an 'object':

    ivo://foo/bar                         <---->  subject
    datalink:hasCalibration           <----> property
    http://example.org/bar-flat.fits     <----> object

Re 4: Pat:

> If we can work with Semantics-WG to come up with a mechanism to support experimentation and development of the vocab, then we might be able to stick that in. (Norman: Can you set aside time to discuss this in Madrid?)

I agree with Pat that it would be useful to factorise out the Datalink standard and the preferred vocabulary.

This would be very easy, and yes, I can discuss this in any detail in Madrid.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK