VEP-009: datalink/core#progenitor

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Mon Jul 19 10:55:20 CEST 2021


Dear Semantics, dear DAL,

Here is another VEP for datalink/core, also available at
http://volute.g-vo.org/svn/trunk/projects/semantics/veps/VEP-009.txt.

I'm including DAL as the WG maintaing the "hosting" standard.  I
suggest that replies go to semantics only.


Vocabulary: http://www.ivoa.net/rdf/datalink-core
Author: Mireille Louys <mireille.louys at unistra.fr>, François Bonnarel <francois.bonnarel at astro.unistra.fr> with slight editorial fixes by Markus Demleitner
Date: 2021-07-19

Term: #progenitor
Action: Modification
Label: Progenitor
Description: Pre-existing science data that were used to create this dataset

Used-in: http://archive.eso.org/datalink/links?ID=ivo://eso.org/ID?ADP.2020-06-16T18:05:22.868


Rationale: The modification involves only the term description. Previous
definition was "data resources that were used to create this dataset
(e.g. input raw data)", which was ambiguous, because it could encompass
any dataset or resource used to produce #this, including calibration
data. According to usage of terms in the workflow and provenance domain
the new Description restricts progenitor to the sole less advanced
science data used to produce #this. By "science data" we mean some
signal directly obtained from observations or simulations of targeted
parts of the sky or derived from it.  This is distinguished from
calibration(-applied) data which allows to transform science data
measurements in absolute physical units. By distinguishing progenitor
and calibration(-applied) data we allow to characterize specific roles
for datasets in the process of production of #this.

The usage example is taken from the ObsCore result
http://archive.eso.org/tap_obs/sync?REQUEST=doQuery&LANG=ADQL&MAXREC=1&QUERY=SELECT+*+FROM+ivoa.ObsCore+WHERE+obs_publisher_did+%3D+%27ivo%3A%2F%2Feso.org%2FID%3FADP.2020-06-16T18%3A05%3A22.868%27

In this case #this (the record in ObsCore) is an IFS cube. The
progenitors are KMOS observations
(https://www.eso.org/sci/facilities/paranal/instruments/kmos.html) as it
appears in DataLink response.



=======================================

In the interest of cutting down on the number mails, let me say
inline here that I fairly strongly oppose this VEP (I'd give it an
8+ on my pain level scale
https://blog.g-vo.org/building-consensus/#scale).

The main reason is that current #progenitor has well-defined
semantics ("earlier in the provenance tree") and well-defined
pragmatics ("give me stuff I need for debugging").  Redefining
well-defined and plausibly useful concepts (that we will therefore
want to reintroduce later because the pragmatics don't go away)
simply is something I would really like to avoid.

Sure, people might not like the identifier #progenitor or the label
"Progenitor", saying they think that's something else than "Earlier
in the provenance tree".  

As to the identifier, I'd say we'll just have to apologise with "it's
legacy" (if people *really* are unhappy with that form).  That's a lot
better than changing existing concepts and then juggling new ones
meaning what the old ones meant.  

As to the label, that's easily changable ("Part of Provenance",
perhaps?).  I'll happily assist in drawing up a VEP on that.


Now, one *could* salvage the VEP by changing it into introducing a
concept "Rawer data in the provenance tree specifically obtained for
#this" (or something like that). I'd be a lot less concerned about
that (I'd give it a 6- at this point).  I'd still like to see some
improvements to this VEP turned into an addition request for
something like *#rawer:

(a) *much* better definition.  The main trouble here is to say what
"science data" is.  Is a background simulation in particle
astrophysics science data?  What about, say, a best fitting model in
a gravitational wave experiment?  When I'm making a superflat, are
the flats entering into it "science data" for that superflat?  And
what if I, as I think I'll do in general if I have the raw data, have
a datalink for the raw data with it's #calibration?  Is that
#progenitor although it contains what I think VEP-009 could call
on-science data?

In the existing defintion, I'm also not sure what the "Pre-existing"
is intended to mean -- does it include or exclude certain subsets of
Datalink's universe of discourse?  If not, let's drop it.

(b) clear pragmatics.  I suspect that's closely related to (a): What
should *a machine* do with such *#rawer items as opposed to other
things in the provenance tree?   The use case here is debugging, and
when I'm doing that I'll probably manually go through the
descriptions of the items in #progenitor anyway.  How could a
computer better assist me in that if it knows the difference between
"science data" and... well, other things?  Or are there separate use
cases to consider (attribution?)?

(c) improved rationale/used-in: The used-in here indeed illustrates
the usage of #progenitor with what appears to be "science data" in
some intuitive sense.  For something as profound as changing a
concept, however, I'd prefer to see a case where the change actually
makes a difference, i.e., where some links drop out of #progenitor,
or where their current concept intersects with #progenitor, or
whatever, in short: I'd like to understand what exactly this is
trying to fix.


Actually, I start to suspect we're quarrelling about things that
nobody even has found useful yet.  I went through my own uses of
#progenitor, and in ~10 different data collections not one actually
cared to include what I think François and Mireille would consider
non-science data.  I really suspect that when people have rawer data
and calibration files, they'd rather stick all of that into a
datalink document of its own and then reference that from the reduced
datalink.

So... does anyone even publish "non-science data from the provenance
tree" in the calibrated data's datalink at this point?  What do they
think?

           -- Markus


More information about the dal mailing list