VEP6: blurry definition for the term #calibration

Wed May 12 09:37:55 CEST 2021

Hi,

On Tue, May 11, 2021 at 10:03:26AM -0700, Patrick Dowler wrote:
> You can make the above ~sentence without worrying about the extra words a
> natural language uses and everyone knows what it means. Specifically,
> everyone naturally assumes that the first first ... means "is a" (present).
> However, with calibration it isn't so clear:
> 
> {link} ... #calibration ... #this
> 
> The debate here is whether that ~sentence means "is a" (not yet applied) or
> "was a" (applied). The relationship is #calibration in both cases and while
> it is tempting to restrict the meaning to the actionable "is a" (present
> tense) I think that kind of approach means we more or less double the
> number of terms: isCalibration and wasCalibration? ugh.

Frankly, I'd not worry so much about these extra terms -- the
vocabulary will not explode either way, so the cost isn't terribly
high, in particular because we have a hierarchy, and so the number of
root terms (i.e. those without a parent) in a way is more important
than the total number of terms; in particular, I don't think the
#flat, #bias, and #dark won't be needed in a prospective
#instrumental-signature (for "progenitor that's not the raw data")
concept.

Having the two concepts "Was used in calibration" and "Can be used
for calibration" separate in datalink, on the other hand, has a clear
benefit in terms of the main use case for the semantics column,
filtering.  You'll fold out links in the "Was used" concept if
you want to debug the data set, and you'll fold out links in
"Can be used" if you want to produce a more refined version of #this.

[In case you're wondering about "fold out", see again the
presentation of a datalink document as in
http://dc.g-vo.org/flare_survey/q/mdl/dlmeta?ID=ivo%3A//org.gavo.dc/~%3Fflare_survey/data/plates/ESO040_004362.fits;
to see the sort of difference that foldability makes, try viewing
this in TOPCAT or with javascript disabled]

So, dropping this distinction hurts a lot.  Plus, as I said, it's
almost impossible conceptually, as I still cannot see a way to
organise #progenitor and #calibration into a tree then.  #calibration
⊂ #progenitor is out because the "Can be used" clearly isn't a
#progenitor, but #calibration clearly isn't distinct from #progenitor
either.

We won't fix this by saying #progenitor is "#this, only rawer", because
the concept "Is used in the provenance chain" still exists (even if
we've not labelled it yet) and quote likely will eventually get a
term.  Not caring now will give us a lot of headache later, because
we can't make that concept then.

> I have been convinced that the terms in a vocabulary shouldn't worry about
> present or past tense. The definition should be flexible enough to convey

I think the question of tense is confusing rather than clarifying
here, because the way we've built the vocabulary, the terms look like
nouns while they are really verb phrases (i.e., they contain some
sort of verb).  Taking out the verb may seem to simplify things, but
it really kills the semantics.

It's a lot more useful to think of the terms as labels for sets of
links (which nicely induces foldability when the sets mutually
disjuct xor hierarchical).  Our future selves will be grateful if
we're not cavalier here.

> I think other commenters generally agreed that tense doesn't apply because
> we are not trying to reproduce the provenance. Can we modify VEP-006 to
> remove any sense of a tense restriction? I would propose a minor
> clarification, 2 options:
> 
> #calibration : resource to calibrate the primary data
> #calibration: resource that has or can be used to calibrate the primary data

The trouble is that we're not doing natural-language semantics here
but formal semantics.  So, before figuring out how to explain to
humans what we mean we need to figure out what we'll tell the
computer (who, really, is the consumer of this information).  This is
what I tried to express in my four alternatives.  We will just *have*
to choose among them; you can't cheat the computer.  I'll try again,
explaining the options a bit differently and adding what I think Paul
and Pat were driving at as (e)

(a) We forget about VEP-006 and the clarification of #calibration,
hoping things will work themselves out as people actually use the
stuff and there are more interfaces that better expose the trouble
with the status quo.

(b) #calibration is a child of #progenitor (and we can think about
adding the VEP-006 concept(s) as children of #auxiliary, i.e., files
aiding in the use of #this).  Again, would work perfectly for me, but
I'll only make a VEP for that if we can approach consensus there.

(c) VEP-006 (which maybe could be improved by making #calibration a
child of #auxiliary).

(d) Deprecate #calibration and children.

(e) #calibration becomes a proper top-level term of its own, disjunct
from #progenitor and everything else.  I'm sure we'll regret that,
because calibration files simply are #progenitors (or at least in the
concept "earlier in the provenance chain").  Also, as I said, toplevel
terms are intellectually a lot more expensive than child terms.

> Datalink semantics is about the relationship between two things so this
> logic may apply to other terms. I would support removing "used" from child
> definitions as well (bias, dark, flat).

Of course all children of #calibration, by virtue of labeling subsets
of the concept, would share its properties.

So... how do we get on here?  Breakout meeting during the interop?

My plea: A few extra terms are a small price to pay for maintaining
proper formal semantics.

        -- Markus