VEP6: blurry definition for the term #calibration

Fri Mar 26 11:54:29 CET 2021

Hi Stéphane, Dear Semantics WG,

On Thu, Mar 25, 2021 at 10:36:58AM +0100, Stéphane Erard wrote:
> The difference between #calibration and #progenitor in this context
> (calibration data for raw vs calibrated) seems unnecessarily
> complicated to me, and possibly misleading. 

Admittedly, (semi-) formal semantics is sometimes a bit tedious, but
that's because computers are tedious, and what we're doing here is
explaining things to computers.

So, the question we answer with VEP-006 (or an alternative) is, in
practical terms: should

  datalink_result.bysemantics("#progenitor")

in pyVO (with https://github.com/astropy/pyvo/pull/241 applied)
return #calibration links or not?  Since we're talking to computers,
the answer can't (usefully) be "maybe".

This has a mathematical background.  As explained in the Vocabularies
spec (shameless plug: It's in RFC, review now!), our terms correspond
to concepts, that is, subsets of our universe of discourse (well,
actually, these subsets are called "extensions"; cf.
https://ivoa.net/documents/Vocabularies/20210114/PR-Vocabularies-2.0-20210114.html#tth_sEc5.2.4)

Datalink is a tree-like vocabulary, and that means that concepts
either need to be disjunct, or one needs to be a subset of the other.

Hence, the root of the matter is to figure out whether

* #calibration is disjuct with #progenitor or 
* #calibration ⊂ #progenitor

– or we'll have to scrap one of the concepts, since I'm sure 
#calibration ⊃ #progenitor is not an option

> In any case, I would certainly reserve #progenitor to identify
> calibrated products used to build a derived product. If this also

Did you mean "uncalibrated products used to build #this"?  If that is
true, that's a possibility, but we would then have to fix
#progenitor's definition (anyone up for a VEP?).

> I would also certainly expect the calibration process to be
> complex, instrument / experiment dependent, evolving with time,
> with inclusion of extra steps and alternative

Right -- but that's not in Datalink's purview any more, that's hard-core
Provenance.

> In short, I think datalink alone cannot always provide all the
> information needed to describe the actual calibration process, and
> I wouldn’t rely on that. 

Exactly.

> When accessing calibrated data, what I really need is a detailed
> description of the calibration, and this goes beyond a list of
> calibration files in the general case. Listing the calibration

Right.  It would be an interesting exercise to use ProvDM to annotate
a Datalink response with that extra information, but that's far
beyond our current question (but something I'd consider exceedingly
useful as a way to furnish things with provenance information without
changing them).

> Therefore, I don’t see any compelling reason to use anything else
> than #calibration in both cases, as the concept « can be used » vs
> « was used » is given by the status of the file (raw vs
> calibrated). Using different tags when a calibration file has been
> applied or not (eg for calibrated and raw data files) is not
> helpful, but is not a show-stopper to me either. But please don’t
> mix observations with calibration files under the same #progenitor
> tag - that would become difficult to entangle.

I understand (and half-heartedly support) this use case; but the
non-destructive (to the vocabulary) way to deal with this (unless we
want to defer it to ProvDM annotation) is to define terms that are
children of #progenitor that make this distinction.  I'm happy to
assist in writing a VEP to do that. 

But saying a #calibration file sometimes is a #progenitor and
sometimes not is subverting the whole scheme, and that's a big step
to take.

So, the situation as I see it is that we'll have to decide between
one of the following options:

(a) We keep thing as they are and we just forget about datalink
semantics being a tree.  You'll understand that I'd be seriously
unhappy with that outcome.

(b) We make #calibration a child of #progenitor ("#calibration
⊂ #progenitor").  That's a fine solution, except I'd ask the
proponents of that to convince Pat, who has, in effect, proposed
VEP-006.

(c) We accept VEP-006, perhaps with some fixes to labels or
definitions (I'm totally open to suggestions); we can then have
additional terms to tell apart "science data" and "calibration files"
below #progenitor.

(d) We deprecate #calibration and children, saying the concepts
cannot be properly defined (and it'd take quite a bit of reasoning to
wear down my resistance against that).

I think that's about it -- or have I forgotton some additional option?

I'd be grateful if people could voice their preferences...

Thanks,

         Markus