VEP6: blurry definition for the term #calibration

Thu Apr 29 14:31:51 CEST 2021

Dear Semantics folks,

On Thu, Apr 22, 2021 at 04:19:52PM +0000, Paul Harrison wrote:
> > On 2021-03 -27, at 12:01, Paul Harrison <paul.harrison at manchester.ac.uk> wrote:
> >> On 2021-03 -26, at 15:30, Markus Demleitner <msdemlei at ari.uni-heidelberg.de <mailto:msdemlei at ari.uni-heidelberg.de>> wrote:
> > 
> >> So, you're basically proposing in addition to the four options at the
> >> foot of
> >> 
> >>  http://mail.ivoa.net/pipermail/semantics/2021-March/002778.html
> >> 
[...]
> > 
> > No - I would go for a modification of your option b) and add
> > another child of #progenitor, perhaps #antecedent - though in
> > natural english I think that they are virtually exact synonyms -
> > that expresses that the file is a direct “less processed data”
> > #progenitor in the sense of my distinction that #calibration is a
> > modifier of rather than a “direct ancestor”, so that #calibration
> > and #antecedent are disjunct.

> I have thought about this proposal a little more, and It seems that
> when trying to satisfy the two use cases
> 
> * distinguish between “science” data and “calibration” data
> * distinguish between used calibration data and alternative calibration data

I think both these use cases are essentially in scope of datalink
semantics, albeit always with the proviso "in relation to #this", and
with the understanding that datalink semantics basically lets you
filter links but will *not* tell you what to *do* with what the row
talks about.  For that latter thing, there's description, or, if you
need it machine-readably, ProvDM.

With the current datalink semantics, there is no #possible-progenitor
(say: "files you could use instead of calibration files actually used
in the provenance chain") that would let you define alternative
calibration data; and the concept seems so complex that I'd think
twice before writing a VEP for it.  

Instead, if a data provider wants to mark this up, I'd say the right
way is to have (a) #progenitor be a datalink document itself.  And in
this document, you'd have the various possible calibration files all
as #calibration (in the VEP-006 sense).

I guess what I'm trying to say is: Let's not stuff an entire
provenance tree into a single datalink document.  It'll make the
document extremely hard to work with, and will push the semantics to
increasingly complex semantics expressing increasingly complex
relationships (I'd call it the "Hobbit trap" for reasons readers of
the Lord of the Rings may understand). 

Having one datalink document per non-trivial #this works, I claim,
much better.  See, for instance, 

http://dc.g-vo.org/flashheros/q/sdl/dlmeta?ID=ivo://org.gavo.dc/~?flashheros/data/ca92/f0011.mt

where you can go between the split echelle orders and a merged
spectrum; if I hand it, I'd even have another datalink document for
the CCD frame the split orders were extracted from (with all its
VEP-006 #calibration links) as the split-order's #progenitor.  This
will let people go as far as they need in re-calibration/debuggin, or
just a single step, and at each point the datalink document lets them
pick whatever relevant artefacts there are *for the task at hand*.

> The problem with the hierarchy approach for the advocates of
> wanting to distinguish between science data and calibration data is
> that it is perfectly possible for a data provider to just tag both
> #calibration and #antecedent data with #progenitor and then the
> distinction is lost. This possibility does make option e) above
> much more attractive as a way of forcing this distinction, and at

(where (e) was making #progenitor something like
"#science-progentor").

> the moment I am wavering as to whether that is my favourite…It does
> have the advantage that if suitably defined it could encompass
> “used” or “alternative” as it is not a child of #progenitor.

Me, I'd very much like to keep the definition of existing terms
unless there's a clear need to make them more precise (as there is
for #calibration) or we're sure we need to adjust them to the broad
community's majority sentiment.  Hence, I'd surely prefer anything
that doesn't force us to change #progenitor.

But again, I think we shouldn't get too hung up on the term forms --
as long as there are clear labels and definitions, I expect in
general data providers will eventually do the right thing.

And if we feel there's a need (and clear definitions), we can always
add child terms below #progenitor for "#rawer-science-data" and
"calibration-data-used".  Given that, can I perhaps again solicit the
pain levels (https://blog.g-vo.org/building-consensus/#scale) of the
people who have chimed in in this discussion on the four options from
http://mail.ivoa.net/pipermail/semantics/2021-March/002778.html:

(a) We keep things as they are and we just forget about datalink
semantics being a tree.  You'll understand that I'd be seriously
unhappy with that outcome.

(b) We make #calibration a child of #progenitor ("#calibration
⊂ #progenitor").  That's a fine solution, except I'd ask the
proponents of that to convince Pat, who has, in effect, proposed
VEP-006.

(c) We accept VEP-006, perhaps with some fixes to labels or
definitions (I'm totally open to suggestions); we can then have
additional terms to tell apart "science data" and "calibration files"
below #progenitor.

(d) We deprecate #calibration and children, saying the concepts
cannot be properly defined (and it'd take quite a bit of reasoning to
wear down my resistance against that).

(where of course I'll happily accept further suggestions for cleaning
this up).

Let's get VEP-006 off the table -- more VEPs are waiting...

Thanks,

          Markus