<div dir="ltr"><div><div>Hi Markus, Kristin, rest,<br><br></div><div>Thanks for the working draft! A provenance model is very important and exactly something that the IVOA can/should help standardize.<br><br>The &#39;wasDerivedFrom&#39; discussion triggered my interest (as one of many things though), so some thoughts about that. I&#39;ve not yet read the full document though, hope to do that soon, so maybe I misunderstand some things.<br></div><div><br></div><div>It seems that the main problem that &#39;wasDerivedFrom&#39; tries to solve, is distinguishing 

&#39;the main 

progenitor&#39; from &#39;auxiliary progenitors&#39;. Ultimately this is an 

impossible problem, as Markus indicated, but can be made a bit tractable with domain knowledge<br><br>Here my 2cnts as someone who spend quite some time thinking about 

this w.r.t. the Kilo Degree Survey (KiDS) processing in Astro-WISE. I&#39;m struggling a bit with how to organize my thoughts in this mail; let me try by short sections with what I learned in the past.<br><br></div><div><br></div><div>1) Separate derivation and application of calibration parameters.<br></div><div><br>Attached version of Kristin&#39;s astrometry example and is similar in idea 

to Markus&#39; suggestion: there is an extra entity containing the astrometric solution. The <a href="http://draw.io">draw.io</a> version: 

<a href="https://drive.google.com/file/d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA/view?usp=sharing">https://drive.google.com/file/d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA/view?usp=sharing</a><br><br>It

 makes sense to see the derivation of a calibration parameter as a 

separate activity from its application, and consider the calibration 

parameter as a separate entity. This separation was very useful for KiDS for many reasons, e.g. reusing the calibration parameters. Splitting up such calibration steps in 

two would also provide a practical resolution to many problems that wasDerivedFrom was introduced for.<br><br></div><div>A (semi-)automated tool that traverses the provenance graph could for example follow &#39;the pixels&#39; and ignore non-pixel entities. Or the other way around: it could ignore entities that are merely some parameters. This will not solve the flat-field example, but the problem is much more ill-defined there, e.g. see Markus arguments.<br><br><br></div><div>2) Add some domain knowledge to the model and the tools.<br></div><div><br><div>Much of the provenance DM working draft is not specific to astronomy at all, and rightly so. However, this is an astronomy document, and the question of &#39;what is the main progenitor&#39; cannot be answered without astronomical knowledge.<br></div><br>One could add a bit of domain knowledge to the data model and the tool: include in the entity-descriptions that the raw-entity and WCS- and flat-identies are of &#39;different&#39; kinds, e.g. &#39;science&#39; and &#39;calibration&#39;. Then the tool could just follow only the &#39;science&#39; entities.<br><br></div><div>We used this mechanism in KiDS where it was successful. Our provenance graphs for a single coadd have literally millions of entities, but we can still navigate them easily by ignoring &#39;calibration&#39; data by default. That is, tools will consider a flat as a progenitor, but will not traverse the progenitors of the flat itself unless explicitly asked to.<br></div><div><br>This knowledge does not have to be part of the provenance data model itself though. Related to the above, &#39;having pixels&#39; is already domain knowledge. Caveat: one persons calibration data is another persons science data.<br><br></div><div><br></div><div>3) The main-auxiliary distinction will become incredibly messy.<br></div><div><br></div><div>Here are some other examples where it is hard to define the main and auxiliary progenitor.<br><br></div><div>Forced photometry: say you have a deep r-band image with perfect astrometry and a shallow u-band image and want r-u colors. Then you can use the r-band source positions to measure the flux in the u-band. Now what is the main progenitor? My conclusion is the r-band image (or catalog) because you&#39;ve added knowledge to that main dataset by adding information from the auxiliary dataset (similar as with flat-fielding). However, one could also argue the other way around: the u-band image is the progenitor because most of the information comes from that image.<br></div><div><br></div><div>Environment quantification (similar to the above): say one has a catalog of interesting galaxies and another catalog with &#39;all&#39; galaxies. Now this second catalog is used to quantify the environment of the first set of galaxies (e.g. by counting near neighbors or so). Now what is the main progenitor? Again the first catalog in my opinion.<br><br>I&#39;m sure many people disagree with my assessments, that&#39;s the point.<br><br><br></div>4) There are no unimportant activities.<br><div><br></div><div>The problem of indicating the &#39;main&#39; progenitor will not be solved by wasDerivedFrom, as indicated above. But it does introduce a problem: now a tool will have to follow both wasGeneratedBy /and/ wasDerivedFrom, because apparently wasDerivedFrom is not a subset of wasGeneratedBy + Used because of &#39;empty&#39; activities. <br><br></div><div>The other reason for wasDerivedFrom is to hide/bypass unimportant activities. This doesn&#39;t make sense to me. Every action 

should be in the model, even if it is just a transformation of the data. Even the most unimportant step can turn out to be very relevant but impossible to reproduce if not properly modeled. <br><br>It&#39;s trivial to add those extra steps and to navigate them using proper tools. The benefit of wasDerivedFrom does not seem to outweigh the extra complexity in the document, at least for this particular goal.<br></div><br><div><br></div><div>*) Conclusion<br></div><div><br>In a direct Dutch way: From my perspective &#39;wasDerivedFrom&#39; is often not necessary (point 1, 2, 4), impossible to get right (1, 3), cannot be trusted (3) and introduces complexity (4).<br><br>It seems my mail and especially the conclusion can be interpreted negatively, that was not the intent. The goal was to be constructive, by sharing experiences, so we can have a great provenance model. Your idea behind provenance and experiences might differ from mine, so please use the information above how it bests suits you and proceed how you think is best.<br></div><br></div><div>I&#39;ll read the entire document soon because it is an heroic effort to model provenance.<br></div><div><br></div><div>Hugo<br></div><div><br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 19, 2017 at 12:53 AM, Markus Demleitner <span dir="ltr">&lt;<a href="mailto:msdemlei@ari.uni-heidelberg.de" target="_blank">msdemlei@ari.uni-heidelberg.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi DM,<br>

<br>

On Tue, Oct 17, 2017 at 02:04:44PM -0700, Tim Jenness wrote:<br>

&gt; As another data point, LSST will have the ability to attach a WCS<br>

&gt; to a raw image that is derived by looking at 1000 processed images.<br>

&gt; We will be tracking the provenance of that WCS and its inputs and<br>

&gt; have to attach it to the raw data as provenance. If someone asks<br>

&gt; for &quot;all the inputs&quot; they are not really going to want all 1000<br>

&gt; processed images. They need those to exactly reproduce the<br>

&gt; processed image they will generate from that updated raw image but<br>

&gt; it&#39;s clearly distinct in the provenance tree.<br>

&gt;<br>

&gt; To be more concrete, if you now coadd two images that came from raw<br>

&gt; data that had WCS derived from 1000 other images, when someone says<br>

&gt; &quot;what went into that coadd&quot; they probably mean the two parent<br>

&gt; images and possibly the two raw data files.<br>

<br>

But isn&#39;t the provenance structure in this case something like (notation<br>

contrived, roles suppressed in this graph -- imagine labels on the<br>

vertices if you will)<br>

<br>

rawim2001 -- Photoproc ----- im2001 -,<br>

              /                       \<br>

  Flatfield and such                   \<br>

              \                         \<br>

rawim2002 -- Photoproc ----- im2002 ---- Coaddition --- coadd10001<br>

                                        /<br>

im1   --,                              /<br>

...   ----- Calibration -- wcs -------/<br>

im1000--/     /<br>

        sectractor conf<br>

<br>

So, if you just look at the immediate operation of the co-addition,<br>

you&#39;ll succintly see that there were two reduced images and a WCS<br>

calibration coming in.  Only when you&#39;re interested in where that<br>

calibration comes from you see the 1000 images, at it should be, and<br>

just as you don&#39;t see the raw images as sources in the coaddition if<br>

the stacking was performed on flatfielded and darkframed images.<br>

<br>

Similarly, in Ole&#39;s example:<br>

<br>

On Tue, 17 Oct 2017 11:24:57 +0200, Ole Streicher wrote:<br>

<br>

&gt; To give you a real-world use case, which is kind-of debugging: Someone<br>

&gt; detects an &quot;interesting structure&quot; on a science-ready exposure, and to<br>

&gt; be sure he wants to process the raw image with his own, alternative<br>

&gt; pipeline (which may or may not need the same kind of calibration). Then<br>

&gt; he has to find out &quot;which is *the* raw image that I need to process?&quot;,<br>

&gt; and the answer is wasDerivedFrom (maybe recursively).<br>

<br>

I argue it&#39;s more straightforward to inspect the photo processing<br>

activity and figure out what the input with the role &quot;raw image&quot; was.<br>

After all, you might just as well suspect that the flat for this day<br>

was flawed and you&#39;d just like to drop in yesterday&#39;s flat, or that<br>

any other gear in the provenance chain is at fault, and you might<br>

just as well want to replace that.<br>

<br>

Sure, you&#39;ll have to define roles in this world for all inputs to all<br>

activities, but I&#39;m sure you want that anyway.<br>

<span class="HOEnZb"><font color="#888888"><br>

          -- Markus<br>

</font></span></blockquote></div><br></div>