VEP6: blurry definition for the term #calibration

Wed Mar 24 12:56:53 CET 2021

Hi Baptiste,

On Wed, Mar 24, 2021 at 11:22:37AM +0100, Baptiste Cecconi wrote:
> There should be a way to tell the a dataset is a #calibration
> product for #this. This is not related to the fact that the
> calibration has been applied or not.

Oh yes, it is, because that determines *when* (as in: while doing
what) you'd like to see the data.

You see, Datalink semantics is, I claim, about filtering out links
not relevant to you for a given task.  For instance, calibration data
already applied you want when "debugging", whereas calibration data
not yet applied is (probably) necessary when "trying to use".

Hence, I'm *very* confident that we want to tell the two cases apart
*in datalink*.

Whether outside of datalink the two cases ought to be dealt with in
parallel is another question we can discuss some other day.

> It seems to me that we're mixing the provenance (how the dataset
> has been produced) and the qualification of a linked dataset
> (what's the purpose of the dataset)… 

Yes we do -- datalink semantics is built like that, because it
doesn't ask what something *is* but what it can be *used for* in
relation to #this.

That's a design decision that I'd argue was an exceedingly good idea
for datalink -- and perhaps it should be stressed a bit further in
the spec.

And while I'm writing, let me also reply to Mireille:

> > Le 23 mars 2021 à 22:13, Mireille LOUYS <mireille.louys at unistra.fr> a écrit :
> > Term: #calibration
> > Action: Modificiation
> > Description: Data products that can be used to remove instrumental
> >   signatures from #this.  
> > 
> > I agree with this first sentence. 
> > I suggest we could even say : "Data products relevant to remove
> > instrumental signatures from #this."

... which is a weaker claim, and by my logic above too weak a claim:
it doesn't tell you what you can use the linked data for with respect
to #this, as you wouldn't know whether you want it for using or for
debugging.

> > I disagree with the two following sentences here below: 
> > 
> > Note that the calibration steps such data products feed have not
> > been applied to #this yet.   To link calibration data already
> > reflected in #this, use #progenitor.
> > 
> > In my understanding, when a dataset is tagged as calibration,
> > this has nothing to do with the fact it has been applied or it is
> > recommended to apply on the dataset in consideration:  #this

This is exactly what we're trying to clarify here; and the fact that
there are these different ideas around clearly shows we need to
change *something*.  VEP-006 is one way to achieve this
clarification.  Others are of course possible.

> > How should I name the datalink semantic tag when I link a PSF
> > dataset to an observation datafile which was used and applied
> > already ? 
> > In case I have 10 progenitors , I don't want to have to sort
> > between all #progenitor-tagged datasets to be able to find which
> > is the PSF one, used for calibration. 

If "give me the assumed PSF" is a use case, we can still accept
VEP-006 and create a new term, perhaps #psf-assumed, that is a child
of #progenitor.  I'm not saying this is what I'd do, I'm just saying
this doesn't force us to have #calibration in #progenitor.

> > Example for merging IFU data cube: 
> > I consider a cube ( 2D+lambda) obtained in a Fusion
> > (recombination) operated on 10 observed cubes. The Fusion uses a
> > PSF cube as calibration within the process. This PSF's version
> > and properties are important to evaluate the quality of the
> > fusion operation and the final merged data product. 
> > In this case I would like to distinguish the two categories:
> > final-merged cube 
> > 	linked via #progenitor :C1, ... C10
> > 	linked via #calibration: PSFCube
> > Some fusion processing could also use 50 cubes, or more.

Can I paraphrase this as "I want to be able to tell apart 'data
progenitors' and 'calibration progenitors'"?  I suppose that's a use
case I find convincing.

> > One more reason : #progenitor should be reserved to designate the
> > data in transformation through various steps within a pipeline.
> > this applies to the data stream...  calibration, configuration,
> > parameter sets have a distinct nature with respect to the data
> > processing.  The two categories should not be mixed, in my view. 

Well, they currently are, as by our current descriptions, I'd be
absolutely justified to link all of them as #progenitor.  If there
are reasons to pick them apart (and the use case from my last
paragraph would count for me), let's make terms for that.

> > #calibration is not linked to the temporal aspect, life of a
> > dataset.  I see it as a contribution for the astronomer to
> > evaluate the pertinence of a discovered data set with respect to
> > scientific criteria.
> > 
> > I hope this help to clarify the confusion between #progenitor and
> > #calibration.

Well... we'll have to clearly say what we want in the vocabulary, so
if we don't do VEP-006, we'll have to do something else.  Let's see
where we stand:

Do you agree in principle that it is desirable to tell apart dark
frames already applied from dark frames to be applied based on my
above argument on how datalink semantics should help filter things
necessary for some kind of action on #this?

In case you still have doubts there: Consider again that datalink
rows are intended to be read as RDF triples.  This means a file
master-flat.fits can be #this in one document.

In the next document, the datalink file for raw-data.fits, it would
be #flat by the current proposal.

Finally, in reduced-data.fits, the same file master-flat.fits now
receives the semantics #progenitor.

This is to illustrate that the datalink semantics is *not* a property
of a file in access_url nor one of #this.  It is a *relation* between
#this and the thing at access_url; it's absolutely ok for an
artefact's semantics to change from datalink file to datalink file.

Having said all that, the question is what to do with #calibration.
If you can't stand VEP-006, I'd be fine with making #calibration a
child of #progenitor -- from my brief survey a while ago it would
seem #calibration isn't so heavily used today that we'd break many
datalink documents.

But then it would clear that #calibration is *not* "data you can use
to calibrate some raw data", and I think at some point we'll want
that as a concept, in particular as people bring raw data to the VO
(which they now can do a lot better, since there is Datalink).

On the other hand, given the choice I'd tend to follow what the
original author (Pat) had in mind as the meaning of the term -- and
that's what VEP-006 is proposing.

Accepting that, your additional use case ("distinguish 'science data'
and 'calibration data' among the progenitors) could sensibly addressed
using children of #progenitor.  I'd even write the VEPs for those
myself if you help we with clear and testable definitions of "science
data" and "calibration data" -- and have an example where these would
then be used.

Thanks,

           Markus