PR#2 for Provenance DM

Tue Sep 3 12:09:24 CEST 2019

Hi Laurent,

On Tue, Sep 03, 2019 at 10:36:18AM +0200, Laurent MICHEL wrote:
> Connection with Semantic
> ========================
> No doubt that connecting models and semantic is a good thing, but the way to
> do it is not clear to me.
> An attribute with a limited set of possible values must be typed as an Enum.
> The Enum is a UML (and VO-DML) dataType which has to be defined in the
> model.

VO-DML as it stands already has a link to vocabularies through
SemanticConcept, which, however, requires SKOS vocabularies.  These
are not really what I'd consider ideal for these little word lists
for various reasons (cf. the current internal draft for the
future Vocabularies spec, http://docs.g-vo.org/vocinvo2.pdf).  

So, while we could use SKOS and it would probably the quickest
"formal" solution, let's look at alternatives:

> Problems come when Enum items must refer to vocabulary entries. In this case
> we have to setup a bridge between the UML and the vocabulary.
> 
> I can see 3 options:
> 
> 1) Setting the attribute as a free string, putting an UML constraint on it
> and saying in the spec that the value must belong to that vocabulary.
>  - this hide a model constraint in the UML
>  + The vocabulary can be extended without changing the model

That's what I'd do (and what happens in VOResource).  The constraint
to use terms from the vocabulary would be in spec language only, and
validators would have to do the checks outside of VO-DML proper.
Given that we will have to support preliminary and deprecated terms
in the vocabulary (see the draft linked above) with usage patterns
not easily expressable in VO-DML, there's not much of a way around
this.

See below on the question of using URLs here.

> 2) Letting the Enum in place, making sure it is consistent with the
> vocabulary and saying in the spec that the Enum is a view on the vocabulary.
>  - Vocabulary extension requires a model update
>  + The UML is complete

That I'd consider a bad idea, as the same information is kept in two
places, which is almost always a recipe for confusion.

Back in the VO-DML review days, we've tried to deliniate what enums
are for [cf. http://mail.ivoa.net/pipermail/dm/2015-May/005180.html,
http://mail.ivoa.net/pipermail/dm/2015-June/005208.html and then
http://mail.ivoa.net/pipermail/dm/2016-January/005297.html); my take
would essentially be: "Use enums if you need a switch statement in
code to deal with them"; it's reasonable to have to update code when
the model changes.  But then there's no point to also represent that
list in a vocabulary, which is designed for simple evolution; updates
to it and to a model enum (will) follow a different process.

However, if terms can be added without having to change client
code, a vocabulary is the more flexible solution and is therefore
preferable.  But the terms then can't be in an enum, which is far
harder to change.

[This discussion conveniently ignores the fact that the question
what's in code and what's in data of course depends on programming
style and use case; I guess as a guideline it's still good enough]

> 3) Using semantic URIs (e.g. http://my-vocab#author) as Enum Items.
>  - Vocabulary extension requires a model update
>  + No item duplication

Again, it shouldn't be enum items.

But the question whether to put full URIs into the free text or just
the terms is an interesting and fundamental one.  In VOResource, I
went for just having the terms.  The consequences are:

+ "normal" users don't even notice there's RDF anywhere near them
  (which they would probably consider an advantage)
+ The strings look nice and compact.
+ An application-specific validator that knows that the terms should
  come from specific vocabulary can still work out typos and such.
-/+ People cannot define their own terms in their own vocabularies.

Whether you consider this latter thing an advantage or a disadvantage
probably depends on your general outlook on the world as well as your
community and use case.

I'm quite sure clients will appreciate the simplicity of not having
to work with multiple different vocabularies and to work out
relationships people may or may not have declared between them; so
far, we've not seen much of this in the VO (which doesn't have to
mean much, admittedly).

You could of course go the datalink way, which is, essentially: "It's
relative URIs to the vocabulary URI
http://www.ivoa.net/rdf/datalink/core".  This means that if someone
writes "#this", we can reliably work out they're talking about
http://www.ivoa.net/rdf/datalink/core#this.  If they write
http://foo.bar#baz, that's still legal, and we know it's an extension
term.

My gut feeling is that that's a cute idea but we should really urge
people to contribute to the IVOA vocabularies.  Which means that my
recommendation is:

* Make the attributes free-text, plain, non-URI terms.
* In the spec, say "Terms here SHOULD be taken from the vocabulary
  at http://www.ivoa.net/rdf/whatever.  The process to add new terms
  is described in Vocabularies in the VO 2 [ok, that spec won't make
  it to REC before ProvDM; we can think about what to actually write
  here to encourage people to contribute to the consensus
  vocabulary]"

        -- Markus