about #calibration (VEP-006) : ----> IMPORTANT for DataLInk EXTENDED USAGE

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Tue Oct 12 09:16:25 CEST 2021


Laurent, All,

[A general plea below; please read that if you're considering to
enter this fray]

On Mon, Oct 11, 2021 at 03:42:37PM +0200, Laurent Michel wrote:
> I don't want to add confusion to this discussion, but I don't see a
> situation where the DL client really needs to know if a calibration is
> applicable or already applied.

Well, these are two entirely different cases.  "Calibration applied"
would be necessary when debugging #this (once there are such links),
"calibration applicable" is helpful when using #this.

Hence, I'd turn this around and say: There are no cases when the
distinction might not matter.

> For me, saying that the calibration is 'relevant-for' should be enough.
> If it is not, if one need to specify the tense, then it looks reasonable to
> support both possible directions (applied and applicable)

Well, the "relevant-for" is already indicated by the inclusion in the
datalink document.  If that's all we want, we can drop the semantics
column.  The tree view in the datalink XSLT has convinced me that
that would be a shame.


GENERAL PLEA
============

Perhaps we should have sent around summaries of the off-mailing list
discussions we had on VEP-006; this might have saved some cycles in
these discussions.  Anyway, if considering to enter the fray, please
carefully read the following exposition to avoid unnecessary
repetitions of arguments, and in particular make sure you state where
you disagree and what your dissenting position is.

You see, VEP-006 isn't a matter of taste, it's fixing a bug.  A bug
that's currently not biting us, but only because a certain class of
links hasn't been used yet.


Theoretical Background
----------------------

Our formal vocabularies (in the case of datalink, an RDF properties
vocabulary) are, mathematically, graphs of concepts.  A concept is a
subset of the universe of discourse, which in the case of Datalink is
the cartesian product of datasets × (URI resources) [1]; said a bit
less abstractedly: A datalink document assigns labels to pairs of
pubDIDs and generic URIs.

That Datalink concepts are relations rather than sets of things makes
things look a bit tricky, but don't let that distract you.  Think of
animal taxonomies if you're confused: The Alpakas are a subset of the
Camels, which are a subset of the Mammals.  Yes, the world usually
isn't structured like that, but we're building *models* here to make
*computers* interact with the world in useful ways.  Semantics is
useful only insofar it does this: let computers do useful things.

Anyway, within this graph of the datalink vocabulary, the main
relationship is rdfs:subPropertyOf.  Basically, this relationship
means that if A is a subproperty of B, then A is a subset of B.
This, in particular, means that A cannot have elements that are not
elements of B.


The Calibration Problem
-----------------------

#calibration, as defined pre-VEP-006, covers all kinds of files that
can somehow be used for calibration.  This, in particular, can
concern files coming with (relatively) raw data -- the classical
example is a raw CCD frame that comes with flats, bias frames, and
whatever else; but note that today's real cases tend to be a lot
trickier, and it's usually nowhere as easy any more to tell "science"
from "calibration" data -- on the one hand, and similar files people
may want to attach to the reduced data to aid in debugging on the
other.

Meanwhile, we have a top concept #progenitor, that, despite its
current identifier and label, really is "Part-of-Provenance".  We may
want to discuss whether it's a good idea to have the identifier
#progenitor for this, and I think I agree the Label "Progenitor"
ought to be changed, but that's a different discussion.  The concept
"Part-of-Provenance" is there, and I don't think anyone disputes that
it's useful.

As soon as we have this concept, pre-VEP-006 #calibration is a
problem, because parts of it belong to Part-of-Provenance (although
nobody has yet spotted any of that in the wild), and other parts
do not.

As argued above, we can't have that.


Separation of Concerns
----------------------

The obvious solution is to split up the current concept.  This is
what VEP-006 does, taking away anything that's part of the
provenance.

One could do it the other way round, taking out all that's *not* part
of provenance, but there are two reasons why that's rather clearly
less desirable:

(a) there are links in the wild matching VEP-006's definitions, but
none that don't.

(b) #calibration has subproperties #bias, #flat, and #dark.  It is
conceivable that, with a bit of care, that semantics is marginally
enough to enable the "use data" use case for a certain class of raw
data ("harmless CCD frames", say).  The use case of the
Part-of-Provenance concept is debugging, and hence there's always a
human figuring out what is what.  For them, inspecting the
descriptions is easy, and hence there's no remotely plausible
scenario where the subproperties might come in handy.

There's a third option: deprecating #calibration and inventing
something else.

But that's really it.  We'll simply have to choose between one of
these three options, or we'll knowingly keep a potentially harmful
bug in the vocabulary.


Blocking anything?
------------------

Based on these considerations, I'd say VEP-006 is obvious.  Again, if
you disagree, please state clearly what part of this derivation you
disagree with.  If there really is an error in this derivation, I
can't fix it if you just say "I don't believe you" or "I feel things
should be different".

In particular, VEP-006 explicitly leaves open the question of what to
do with calibration-type Part-of-Provenance links once somebody wants
to have them, in contrast to what François seems to occasionally
imply.

If they come along, we can do either of

* Stick them into Part-of-Provenance (whatever this will then be
  labeled as) 
* Create a child of that Part-of-Provenance concept somehow trying to
  define what exactly makes calibration data calibration data if we
  find a use case where that's necessary
* Stick the whole concept somewhere entriely different because when 
  we actually understand why someone creates such links we notice
  that's not about debugging at all but about... well, I don't know,
  but if it happens, VEP-006 will certainly not be our problem.

So... Feel free to discuss on.  But please do it on the basis of what
was already worked out regarding VEP-006 in the past 13 months
(phewy!), and please do not ignore that whatever we do must be
consistent with RDF and the wider world of semantics.

Thanks,

            Markus


[1] ok, this is a bit of a simplification, but bear with me here;
making this more careful wouldn't change the conclusions.


More information about the dal mailing list