Enumerations in VO-DML

Fri May 15 13:28:48 CEST 2015

Dear DM WG,

While reviewing the current VO-DML draft, I started disliking
enumerations.  Or Skosconcept.  Or possibly both.  If you think I'm
crazy ("we've had enums in C since the eighties!"), that's fine.
Otherwise, grant me a few minutes before you set off into your
Weekend and let me detail where that disliking lead me.  Yes, it's a
bit rambling, but I promise there's going to be something like a
conclusion.

I'm coming from the experiences in the Registry WG.  In VOResource, we
have concepts like <content level> (with values like "research",
"community college", "amateur", and so on) or <relationship type>
("servedBy", "serviceFor", "derivedFrom", etc).  For both, we'd have
liked to update the vocabularies (used here in the loose sense of "word
list"); in content level, we'd like to slowly transition to fewer values
(which means adding one or two for a while) to make it work better, for
the relationship type, we'd like to add "Uses", which would allow you to
locate in the Registry tutorials and similar material for certain client
software or services.

We won't do either for the time being, instead relying on hacks, because
both vocabularies are baked into the schema, and a schema change, at
least as long as we're changing the target name spaces, is extremely
painful.  Note that both changes wouldn't have broken anything in the
Registry.

So, my general topic is "vocabularies" and things like them.  We've
certainly had our share of those in the VO.  There's been UCDs,
"classic" utypes were little more than a vocabulary, there's the IVOA
thesaurus (which admittedly does more than a vocabulary, but that's
true of many of the other things I mention here, too), there's at
least the plan to use SKOS, as expressed in the "Vocabularies in the
VO" REC, there's what boils down to term lists embedded in XML schema
files, mostly using XSD enumerations (I just whisper the term
"substitution group" to hint that things at times got worse than
that), there's what we're now introducing in Datalink (which is
essentially RDF disguised so nobody panics), and there's probably
more I forget about.

And now VO-DML's enumerations -- and let me quote that endless source
of wisdom that is the Zen of python:

  $ python -c "import this" | grep way
  There should be one-- and preferably only one --obvious way to do it.
  Although that way may not be obvious at first unless you're Dutch.

(try it on your computer, I'm not making this up).

I hope I could instill in you some sympathy for my qualms.

So, once I started pondering enums, given the experiences with
VOResource I first thought we shouldn't bake any word lists into data
model definitions any more.  But then -- the whole thing is just so
painful because with XML namespaces, parsing is an all-or-nothing
business: Either your program knows the namespace, or you don't
understand a single thing.  Change the namespace, and all your
clients are broken (unless they employ ugly hacks, which admittedly
most do; but then the whole effort with namespaces is reduced to a
nuisance).  So, with our current way of changing the target namespace
on schema changes, there is no best-effort parsing (i.e., gobble up
what you understand, ignore the rest).

With VO-DML, that hopefully changes -- at least in my mind, the most
attractive property of VO-DML within VOTable is that you get to do
partial "understanding", cherrypicking what parts of a DM you're
concerned with, safely ignoring everything else.  Conversely, that means
that DM changes should be much cheaper than our current schema changes.
And that means that for word lists, VO-DML and its enumerations may be
exactly the tool to use -- at least you then have them in a perfectly
computer-readable format at a well-known place, potentially nicely
formatted into nice HTML or PDF.

Ok, so enums may be a good idea after all.  But then we have, in
addition to all the stuff listed above, yet one more
word-list-keeping tech around.  Can't we then do away with some of
them and use VO-DML enums wherever we actually have DMs?

There are two answers I could come up with.  One is that no mainstream
tools can do inference or whatever else on our VO-DML enumerations.  The
second is that our enumerations don't have anything to do inference on
in the first place, i.e., there are no semantic relations between the
terms in an enumeration.

Although I'm not aware of any VO tool or even VO research making use of
semantic relationships (other than the thesaurus browser and RE-matching
of UCDs), I believe that's actually a valid argument, as we *should* be
using such semantics, and quite possibly we're soon going to see the
first such applications when datalink receives take-up (note the "when"
rather than "if").

After this, I figured there are three ways out:

(1) We could teach enumerations basic semantic relationships and say
"that's how we do semantics in the VO".  One disadvantage of that is
that we'd be an island in that respect.

(2) We could still throw out enumerations and only use external
vocabularies.  We'd have to make sure we exactly define what kind of
semantics clients are to understand (e.g., I'd certainly *not* want
to have to do semantics when using the <relationship type> from
VOResource).

(3) We keep both enums (simple word lists) and external vocabularies
(where semantic structure is/may be required).  In addition to that,
we'd still have UCDs and the Thesaurus, so keeping both is only 25% more
technologies than throwing either one out, and at least enumerations
would remain a simple thing where nobody would ever have to bother with
hyponyms and hyperonyms, not to mention metonyms.

I have to say I'm strongly leaning towards (3), but that's mainly gut
feeling.  Assuming we went for (3), the question is: do we want to
specialise on SKOS for the semantics-enabled kind?  Here, my answer is
no.  As Norman has kindly explained to me, SKOS has surprising pitfalls;
in particular, its default narrower and wider relations aren't
transitive, which to me feels like an ugly trap waiting to snap.

This long and winding road leads me to a tentative suggestion: Whereever
there's SKOSconcept now, we should just have some generic URI.  And if
we're courageous, we'll say in VO-DML that people are encouraged to use
datalink's pattern, and we'll have some place in VO-DML documents into
which people can stuff the DM's (or maybe just attribute's?) default
vocabulary URI.

Thoughts?  Counter-Rambles?

Thanks for your patience,

           Markus