WasDerivedFrom vs. WasGeneratedBy

Fri Oct 20 12:44:09 CEST 2017

Hi Markus, Kristin, rest,

Thanks for the working draft! A provenance model is very important and
exactly something that the IVOA can/should help standardize.

The 'wasDerivedFrom' discussion triggered my interest (as one of many
things though), so some thoughts about that. I've not yet read the full
document though, hope to do that soon, so maybe I misunderstand some things.

It seems that the main problem that 'wasDerivedFrom' tries to solve, is
distinguishing 'the main progenitor' from 'auxiliary progenitors'.
Ultimately this is an impossible problem, as Markus indicated, but can be
made a bit tractable with domain knowledge

Here my 2cnts as someone who spend quite some time thinking about this
w.r.t. the Kilo Degree Survey (KiDS) processing in Astro-WISE. I'm
struggling a bit with how to organize my thoughts in this mail; let me try
by short sections with what I learned in the past.

1) Separate derivation and application of calibration parameters.

Attached version of Kristin's astrometry example and is similar in idea to
Markus' suggestion: there is an extra entity containing the astrometric
solution. The draw.io version:
https://drive.google.com/file/d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA/view?usp=sharing

It makes sense to see the derivation of a calibration parameter as a
separate activity from its application, and consider the calibration
parameter as a separate entity. This separation was very useful for KiDS
for many reasons, e.g. reusing the calibration parameters. Splitting up
such calibration steps in two would also provide a practical resolution to
many problems that wasDerivedFrom was introduced for.

A (semi-)automated tool that traverses the provenance graph could for
example follow 'the pixels' and ignore non-pixel entities. Or the other way
around: it could ignore entities that are merely some parameters. This will
not solve the flat-field example, but the problem is much more ill-defined
there, e.g. see Markus arguments.

2) Add some domain knowledge to the model and the tools.

Much of the provenance DM working draft is not specific to astronomy at
all, and rightly so. However, this is an astronomy document, and the
question of 'what is the main progenitor' cannot be answered without
astronomical knowledge.

One could add a bit of domain knowledge to the data model and the tool:
include in the entity-descriptions that the raw-entity and WCS- and
flat-identies are of 'different' kinds, e.g. 'science' and 'calibration'.
Then the tool could just follow only the 'science' entities.

We used this mechanism in KiDS where it was successful. Our provenance
graphs for a single coadd have literally millions of entities, but we can
still navigate them easily by ignoring 'calibration' data by default. That
is, tools will consider a flat as a progenitor, but will not traverse the
progenitors of the flat itself unless explicitly asked to.

This knowledge does not have to be part of the provenance data model itself
though. Related to the above, 'having pixels' is already domain knowledge.
Caveat: one persons calibration data is another persons science data.

3) The main-auxiliary distinction will become incredibly messy.

Here are some other examples where it is hard to define the main and
auxiliary progenitor.

Forced photometry: say you have a deep r-band image with perfect astrometry
and a shallow u-band image and want r-u colors. Then you can use the r-band
source positions to measure the flux in the u-band. Now what is the main
progenitor? My conclusion is the r-band image (or catalog) because you've
added knowledge to that main dataset by adding information from the
auxiliary dataset (similar as with flat-fielding). However, one could also
argue the other way around: the u-band image is the progenitor because most
of the information comes from that image.

Environment quantification (similar to the above): say one has a catalog of
interesting galaxies and another catalog with 'all' galaxies. Now this
second catalog is used to quantify the environment of the first set of
galaxies (e.g. by counting near neighbors or so). Now what is the main
progenitor? Again the first catalog in my opinion.

I'm sure many people disagree with my assessments, that's the point.

4) There are no unimportant activities.

The problem of indicating the 'main' progenitor will not be solved by
wasDerivedFrom, as indicated above. But it does introduce a problem: now a
tool will have to follow both wasGeneratedBy /and/ wasDerivedFrom, because
apparently wasDerivedFrom is not a subset of wasGeneratedBy + Used because
of 'empty' activities.

The other reason for wasDerivedFrom is to hide/bypass unimportant
activities. This doesn't make sense to me. Every action should be in the
model, even if it is just a transformation of the data. Even the most
unimportant step can turn out to be very relevant but impossible to
reproduce if not properly modeled.

It's trivial to add those extra steps and to navigate them using proper
tools. The benefit of wasDerivedFrom does not seem to outweigh the extra
complexity in the document, at least for this particular goal.

*) Conclusion

In a direct Dutch way: From my perspective 'wasDerivedFrom' is often not
necessary (point 1, 2, 4), impossible to get right (1, 3), cannot be
trusted (3) and introduces complexity (4).

It seems my mail and especially the conclusion can be interpreted
negatively, that was not the intent. The goal was to be constructive, by
sharing experiences, so we can have a great provenance model. Your idea
behind provenance and experiences might differ from mine, so please use the
information above how it bests suits you and proceed how you think is best.

I'll read the entire document soon because it is an heroic effort to model
provenance.

Hugo

On Thu, Oct 19, 2017 at 12:53 AM, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Hi DM,
>
> On Tue, Oct 17, 2017 at 02:04:44PM -0700, Tim Jenness wrote:
> > As another data point, LSST will have the ability to attach a WCS
> > to a raw image that is derived by looking at 1000 processed images.
> > We will be tracking the provenance of that WCS and its inputs and
> > have to attach it to the raw data as provenance. If someone asks
> > for "all the inputs" they are not really going to want all 1000
> > processed images. They need those to exactly reproduce the
> > processed image they will generate from that updated raw image but
> > it's clearly distinct in the provenance tree.
> >
> > To be more concrete, if you now coadd two images that came from raw
> > data that had WCS derived from 1000 other images, when someone says
> > "what went into that coadd" they probably mean the two parent
> > images and possibly the two raw data files.
>
> But isn't the provenance structure in this case something like (notation
> contrived, roles suppressed in this graph -- imagine labels on the
> vertices if you will)
>
> rawim2001 -- Photoproc ----- im2001 -,
>               /                       \
>   Flatfield and such                   \
>               \                         \
> rawim2002 -- Photoproc ----- im2002 ---- Coaddition --- coadd10001
>                                         /
> im1   --,                              /
> ...   ----- Calibration -- wcs -------/
> im1000--/     /
>         sectractor conf
>
> So, if you just look at the immediate operation of the co-addition,
> you'll succintly see that there were two reduced images and a WCS
> calibration coming in.  Only when you're interested in where that
> calibration comes from you see the 1000 images, at it should be, and
> just as you don't see the raw images as sources in the coaddition if
> the stacking was performed on flatfielded and darkframed images.
>
> Similarly, in Ole's example:
>
> On Tue, 17 Oct 2017 11:24:57 +0200, Ole Streicher wrote:
>
> > To give you a real-world use case, which is kind-of debugging: Someone
> > detects an "interesting structure" on a science-ready exposure, and to
> > be sure he wants to process the raw image with his own, alternative
> > pipeline (which may or may not need the same kind of calibration). Then
> > he has to find out "which is *the* raw image that I need to process?",
> > and the answer is wasDerivedFrom (maybe recursively).
>
> I argue it's more straightforward to inspect the photo processing
> activity and figure out what the input with the role "raw image" was.
> After all, you might just as well suspect that the flat for this day
> was flawed and you'd just like to drop in yesterday's flat, or that
> any other gear in the provenance chain is at fault, and you might
> just as well want to replace that.
>
> Sure, you'll have to define roles in this world for all inputs to all
> activities, but I'm sure you want that anyway.
>
>           -- Markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20171020/85025744/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lsst-coadd-wcs-explicit.png
Type: image/png
Size: 44650 bytes
Desc: not available
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20171020/85025744/attachment-0001.png>