External semantics

Fri Jun 19 16:02:18 CEST 2015

Greetings, all.

[Longish: (i) summary of my interpretation of the discussion yesterday; (ii) strawman VO-DML text replacing the current text of Sect.4.15.]

There was a very useful discussion in the Thursday DM session, about enumerations vs external semantics; it followed up Markus's message of <http://mail.ivoa.net/pipermail/dm/2015-May/005180.html>.

There seemed to be some consensus that there was a role for enumerations baked into the data model (for the case where the 'enumeration' property -- that the terms are known to be exhaustive and exclusive -- is important enough that the hard-to-update cost is worth paying), and a role for separate/external semantics in 'vocabularies' of some type.

In the session, Gerard noted that if the 'external semantics' is flexible, then there's a temptation to do all sorts of modelling work in that framework.  Quite right: we don't want this to be a 'back door' to subvert the modelling agreed in a particular VO-DML model.

Gerard also referred to the earlier SKOSConcept formulation (in SimDB?) of saying 'you can have any term here that is skos:narrower than concept X' (this had a rather fiddly technical description).  I think that articulates a good balance between allowing useful flexibility in these external semantics, and discouraging people from smuggling in extra modelling.

Although it made sense in those conversations around SimDB, I don't (now) think this is quite general enough.  The 'S' in 'SKOS' is for 'simple', and it achieves that by being specific to the case where you're describing 'concepts' in the context of searching or browsing.  It is deliberately vague, and in particular it does _not_ include the idea of subclass or 'isA' (thus saying 'car' skos:narrower 'steering wheel', or 'mammal' narrower 'cat', does not imply either that a steering wheel is a type of car, or that a cat is a type of mammal).

For the cases where such a relationship is useful (for example 'dark' isA 'calibration-file'), I suggest that RDFS is a suitable lightweight alternative (see Sect.3 of <http://www.w3.org/TR/rdf-schema/> for the very small range of relations defined here).  I think that Markus was persuaded that this structure, however notated, was useful for the Datalink extension vocabulary.

If that is agreed, then the problem becomes how to express this range of possibilities in the VO-DML document in such a way as to license the flexibility of using SKOS or RDFS or ..., but _not_ to license using this as a back door for other modelling.  I don't think there's a technical way of doing that, which is the analogue of the 'must be skos:narrower than X' formulation above.

Turning to the VO-DML document, and a replacement for Sect.4.15, how about the following strawman text:

---vvv---

Section 4.14.1 skosconcept -> topconcept

[...]

Section 4.15 TopConcept (???not the best name)

It is a common pattern in data modeling that one wishes to constrain the set of values on an attribute to some predefined list. One way to do so is using an Enumeration as the attribute's datatype. A user of a data model knows immediately that the elements of the enumeration are exhaustive and exclusive, and also that they are reasonably slow to change. These features can sometimes, however, be disadvantages, for example when a list of terms might be very large and should be allowed to evolve over time, or is predefined and possibly maintained by another party.  In such cases, the values should be constrained by some external semantic structure, references to which are supported by the TopConcept type.

This mechanism should not be taken as an invitation to subvert the main VO-DML model by introducing arbitrary external modelling frameworks.  The two mechanisms described below, using SKOS vocabularies and RDFS subclassing, are intended to be illustrative rather than exhaustive, and if these are felt to be insufficient for some reason, the alternative should be compatible in spirit with these.

SKOS vocabularies: The IVOA Recommendation Vocabularies in the Virtual Observatory specifies that the format for such vocabularies should be "based on the W3C's Resource Description Framework (RDF) and the Simple Knowledge Organization System (SKOS)" [17].  When using a SKOS vocabulary as the external semantic structure, the topconcept attribute names a SKOS Concept (that is, an instance of skos:Concept): all of the actual instances of the associated attribute (XXX my VO-DML vocabulary is confused at this point!) must be narrower than this Concept.  To be precise, for a top concept T, any concept c is a valid value for this property, if either:

   c skos:broaderTransitive T .

or if there exists a concept X such that

   c skos:broaderTransitive X. X skos:broadMatch T.

(this just means that, if c is in the same vocabulary as T, then it's connected by a chain of any number of skos:broader, and if it's in a different vocabulary, then there is some X which is in the same vocabulary as c, with a cross-vocabulary link between X and T).

The SKOS thesaurus-based approach is most useful in the context of searching and browsing of resources.  It is not intended to be useful for any sort of inferencing, and in particular does not support a subclassing or 'Is-A' relationship.  Although it might be tempting to say, for example, something like 'calibration-image' skos:narrower 'dark-image', one is not formally permitted to conclude from this that a dark is a type of calibration image (even though that is true).

RDFS ontologies: The RDF Schema standard <http://www.w3.org/TR/rdf-schema/> provides the minimal structures which are necessary for simple ontologies, and the inferencing associated with them.  It includes domain and range constraints, and subtyping of classes and properties, but cannot, for example, express exclusivity of two terms.  If the external semantic structure is of this type, then the topconcept attribute names an rdfs:Class (not an rdfs:Property), and the actual instances of the associated attribute must be (transitively) rdfs:subClassOf this class.

It is not necessary to indicate in the VO-DML model which of these options has been chosen, since the URI which is the value of the attribute will contain its own typing information [XXX I'm implicitly assuming RDF here, and that a retrieval of the URI will contain something, in some syntax or other, which contains this information -- ie, the Linked Data model]. 

---^^^---

I hope everyone enjoyed Sesto, and ate well.  Have a good trip home.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK