VEP6: blurry definition for the term #calibration

Stéphane Erard stephane.erard at obspm.fr
Thu Mar 25 10:36:58 CET 2021


Hello


My 2 cents - I essentially concur with Mireille and Baptiste. 
When browsing a dataset I need to, in this order:
1- tell apart observations from calibration and housekeeping files
2- identify calibrated and raw observation files (and possibly derived ones)
3- possibly identify calibration files used/to be used in the calibration process for a given observation, if this work has already been done, or can be automated
4- access the detailed calibration history of a calibrated observation (successive calibration steps, reference to calibration files and other quantities used in the process)

I would expect 1-2 to be provided by some simple flags associated to the file; 4 to be described (if possible) in the calibrated files themselves, and I understand Provenance can provide this; 3 somewhere in the middle, depending on context, and actual data level (calibrated vs derived).

The difference between #calibration and #progenitor in this context (calibration data for raw vs calibrated) seems unnecessarily complicated to me, and possibly misleading. 
But I would certainly expect #progenitor to document derived data (made from multiple calibrated files), with #calibration relating to extra calibration steps in this process (the common PSF in Mireille’s example).
The use cas I have in mind is a spectral parameter map built from many spectral cubes, but I think the conclusion is identical to Mireille’s example. 
In any case, I would certainly reserve #progenitor to identify calibrated products used to build a derived product. If this also links to calibration files, it will become difficult to identify the building pieces - at this stage, I’m no longer interested in details of calibration.

I would also certainly expect the calibration process to be complex, instrument / experiment dependent, evolving with time, with inclusion of extra steps and alternative techniques/algorithms. In particular, some calibration files are only relevant to specific calibration steps, which may or may not be included in a pipeline; various algorithms for noise reduction or resampling can be applied; calibration may use data which are not included in the dataset (e.g. in my use case, a set of photometric correction coefficients computed on the fly, or spacecraft location / attitude files which are included in another dataset).
In short, I think datalink alone cannot always provide all the information needed to describe the actual calibration process, and I wouldn’t rely on that. 
When accessing raw data, I would appreciate to be told: (this file) is a possible calibration input for (that dataset) [or in the best possible case for (that particular file), e.g. in a context where spectral registration changes over time]. But I also expect this situation to be unusual, as it requires some (unfinished) work on a raw dataset. 
When accessing calibrated data, what I really need is a detailed description of the calibration, and this goes beyond a list of calibration files in the general case. Listing the calibration files actually used is probably the most I would expect from datalink, but in general this wouldn’t encompass the complete calibration process. And if I want to modify the calibration process, I’ll start from the raw data. 

Therefore, I don’t see any compelling reason to use anything else than #calibration in both cases, as the concept « can be used » vs « was used » is given by the status of the file (raw vs calibrated). Using different tags when a calibration file has been applied or not (eg for calibrated and raw data files) is not helpful, but is not a show-stopper to me either. But please don’t mix observations with calibration files under the same #progenitor tag - that would become difficult to entangle.

Similarly, I don’t think « calibration progenitor » is particularly useful. If a master dark is included, it seems that #progenitor is fit to link to the individual files used in the summation (this is actually a derived product, not a calibrated one), and the master dark can be described with #calibration when related to an observation file, be it calibrated or raw.

Cheers
Stéphane


> Le 24 mars 2021 à 12:56, Markus Demleitner <msdemlei at ari.uni-heidelberg.de> a écrit :
> 
> Hi Baptiste,
> 
> On Wed, Mar 24, 2021 at 11:22:37AM +0100, Baptiste Cecconi wrote:
>> There should be a way to tell the a dataset is a #calibration
>> product for #this. This is not related to the fact that the
>> calibration has been applied or not.
> 
> Oh yes, it is, because that determines *when* (as in: while doing
> what) you'd like to see the data.
> 
> You see, Datalink semantics is, I claim, about filtering out links
> not relevant to you for a given task.  For instance, calibration data
> already applied you want when "debugging", whereas calibration data
> not yet applied is (probably) necessary when "trying to use".
> 
> Hence, I'm *very* confident that we want to tell the two cases apart
> *in datalink*.
> 
> Whether outside of datalink the two cases ought to be dealt with in
> parallel is another question we can discuss some other day.
> 
>> It seems to me that we're mixing the provenance (how the dataset
>> has been produced) and the qualification of a linked dataset
>> (what's the purpose of the dataset)… 
> 
> Yes we do -- datalink semantics is built like that, because it
> doesn't ask what something *is* but what it can be *used for* in
> relation to #this.
> 
> That's a design decision that I'd argue was an exceedingly good idea
> for datalink -- and perhaps it should be stressed a bit further in
> the spec.
> 
> And while I'm writing, let me also reply to Mireille:
> 
>>> Le 23 mars 2021 à 22:13, Mireille LOUYS <mireille.louys at unistra.fr> a écrit :
>>> Term: #calibration
>>> Action: Modificiation
>>> Description: Data products that can be used to remove instrumental
>>>  signatures from #this.  
>>> 
>>> I agree with this first sentence. 
>>> I suggest we could even say : "Data products relevant to remove
>>> instrumental signatures from #this."
> 
> ... which is a weaker claim, and by my logic above too weak a claim:
> it doesn't tell you what you can use the linked data for with respect
> to #this, as you wouldn't know whether you want it for using or for
> debugging.
> 
>>> I disagree with the two following sentences here below: 
>>> 
>>> Note that the calibration steps such data products feed have not
>>> been applied to #this yet.   To link calibration data already
>>> reflected in #this, use #progenitor.
>>> 
>>> In my understanding, when a dataset is tagged as calibration,
>>> this has nothing to do with the fact it has been applied or it is
>>> recommended to apply on the dataset in consideration:  #this
> 
> This is exactly what we're trying to clarify here; and the fact that
> there are these different ideas around clearly shows we need to
> change *something*.  VEP-006 is one way to achieve this
> clarification.  Others are of course possible.
> 
>>> How should I name the datalink semantic tag when I link a PSF
>>> dataset to an observation datafile which was used and applied
>>> already ? 
>>> In case I have 10 progenitors , I don't want to have to sort
>>> between all #progenitor-tagged datasets to be able to find which
>>> is the PSF one, used for calibration. 
> 
> If "give me the assumed PSF" is a use case, we can still accept
> VEP-006 and create a new term, perhaps #psf-assumed, that is a child
> of #progenitor.  I'm not saying this is what I'd do, I'm just saying
> this doesn't force us to have #calibration in #progenitor.
> 
>>> Example for merging IFU data cube: 
>>> I consider a cube ( 2D+lambda) obtained in a Fusion
>>> (recombination) operated on 10 observed cubes. The Fusion uses a
>>> PSF cube as calibration within the process. This PSF's version
>>> and properties are important to evaluate the quality of the
>>> fusion operation and the final merged data product. 
>>> In this case I would like to distinguish the two categories:
>>> final-merged cube 
>>> 	linked via #progenitor :C1, ... C10
>>> 	linked via #calibration: PSFCube
>>> Some fusion processing could also use 50 cubes, or more.
> 
> Can I paraphrase this as "I want to be able to tell apart 'data
> progenitors' and 'calibration progenitors'"?  I suppose that's a use
> case I find convincing.
> 
> 
>>> One more reason : #progenitor should be reserved to designate the
>>> data in transformation through various steps within a pipeline.
>>> this applies to the data stream...  calibration, configuration,
>>> parameter sets have a distinct nature with respect to the data
>>> processing.  The two categories should not be mixed, in my view. 
> 
> Well, they currently are, as by our current descriptions, I'd be
> absolutely justified to link all of them as #progenitor.  If there
> are reasons to pick them apart (and the use case from my last
> paragraph would count for me), let's make terms for that.
> 
> 
>>> #calibration is not linked to the temporal aspect, life of a
>>> dataset.  I see it as a contribution for the astronomer to
>>> evaluate the pertinence of a discovered data set with respect to
>>> scientific criteria.
>>> 
>>> I hope this help to clarify the confusion between #progenitor and
>>> #calibration.
> 
> Well... we'll have to clearly say what we want in the vocabulary, so
> if we don't do VEP-006, we'll have to do something else.  Let's see
> where we stand:
> 
> Do you agree in principle that it is desirable to tell apart dark
> frames already applied from dark frames to be applied based on my
> above argument on how datalink semantics should help filter things
> necessary for some kind of action on #this?
> 
> In case you still have doubts there: Consider again that datalink
> rows are intended to be read as RDF triples.  This means a file
> master-flat.fits can be #this in one document.
> 
> In the next document, the datalink file for raw-data.fits, it would
> be #flat by the current proposal.
> 
> Finally, in reduced-data.fits, the same file master-flat.fits now
> receives the semantics #progenitor.
> 
> This is to illustrate that the datalink semantics is *not* a property
> of a file in access_url nor one of #this.  It is a *relation* between
> #this and the thing at access_url; it's absolutely ok for an
> artefact's semantics to change from datalink file to datalink file.
> 
> Having said all that, the question is what to do with #calibration.
> If you can't stand VEP-006, I'd be fine with making #calibration a
> child of #progenitor -- from my brief survey a while ago it would
> seem #calibration isn't so heavily used today that we'd break many
> datalink documents.
> 
> But then it would clear that #calibration is *not* "data you can use
> to calibrate some raw data", and I think at some point we'll want
> that as a concept, in particular as people bring raw data to the VO
> (which they now can do a lot better, since there is Datalink).
> 
> On the other hand, given the choice I'd tend to follow what the
> original author (Pat) had in mind as the meaning of the term -- and
> that's what VEP-006 is proposing.
> 
> Accepting that, your additional use case ("distinguish 'science data'
> and 'calibration data' among the progenitors) could sensibly addressed
> using children of #progenitor.  I'd even write the VEPs for those
> myself if you help we with clear and testable definitions of "science
> data" and "calibration data" -- and have an example where these would
> then be used.
> 
> 
> Thanks,
> 
>           Markus



More information about the semantics mailing list