New ProvenanceDM working draft released, part I

Tue Oct 17 23:04:44 CEST 2017

As another data point, LSST will have the ability to attach a WCS to a raw image that is derived by looking at 1000 processed images. We will be tracking the provenance of that WCS and its inputs and have to attach it to the raw data as provenance. If someone asks for “all the inputs” they are not really going to want all 1000 processed images. They need those to exactly reproduce the processed image they will generate from that updated raw image but it’s clearly distinct in the provenance tree.

To be more concrete, if you now coadd two images that came from raw data that had WCS derived from 1000 other images, when someone says “what went into that coadd” they probably mean the two parent images and possibly the two raw data files.

— 
Tim Jenness

> On Oct 17, 2017, at 13:57 , Kristin Riebe <kriebe at aip.de> wrote:
> 
> Hi Markus, DM,
> 
> so in summary: you argue that all output entities are derived from all input entities, and it doesn't matter if they are just log files, config files, whatever.
> 
> Well, I imagined the use case that I have an image and just would like to get all the progenitor images without any other logs/configs/auxiliary files. Because I just want to reprocess the image from a certain step onwards, regardless of the previously used parameters.
> 
> But yes, even for this case it wouldn't hurt to get the additional auxiliary files etc. as well. So maybe you're right and wasDerivedFrom is not needed for this.
> 
> You argue from the provenance discovery point, which becomes more difficult if there are multiple ways to get something.
> Especially concerning the "short-cut" relationship "wasInformedBy" (between activities), I usually considered it from the provenance recording point of view.
> WasInformedBy was introduced to facilitate creating provenance metadata. If it doesn't take too much effort, then people are more willing to actually to it.
> 
> I can hear you already arguing that it's easier for everyone if there is just one well-defined way to write provenance. ;-)
> And I guess even if this looks then more complicated than necessary (with empty activities/entities here and there), a provenance recording tool may help to hide this from the user.
> 
> So in that sense a pipeline author who wants to record steps of a pipeline but does not want or need to record any intermediate entities, could write a function "wasInformedBy" for optimization. And this function then in fact transforms the wasInformedBy relationship into a used-relation, dummy entity and a wasGeneratedBy relation.
> I'm not sure if I really want to have such dummy entities and activities (for which we even need to come up with unique identifiers) creeping in everywhere (in codes, serializations and even in TAP responses.)
> 
> Maybe we should ask some more people writing pipelines how they'd imagine to record provenance. In fact, I also do not need wasDerivedFrom or wasInformedBy in my use case, but I'd like to have the opinion of some more people before we really decide to drop them.
> 
> Cheers,
> Kristin
> 
> 
> Am 17.10.2017 um 09:25 schrieb Markus Demleitner:
>> Hi Kristin, Hi DM,
>> I think I'd like to dwell on the relationship between
>> wasGeneratedBy/Used and WasDerivedFrom a bit more in that:
>> On Mon, Oct 16, 2017 at 10:50:13PM +0200, Kristin Riebe wrote:
>>>>>> happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
>>>>>> cover what WasDerivedFrom is designed to do and then drop
>>>>>> WasDerivedFrom?
>>>>> [...]
>>>>> The main difference is: not every input of an activity that generated an
>>>>> entity will automatically have a "WasDerivedFrom", it's semantically
>>>>> different. E.g. an image is usually "derived from" another image, but not
>>>>> from "auxiliary" input like a configuration file or a parameter (which were
>>>>> also used as input from the generating activity).
>>>> 
>>>> Hm -- why not?  I don't really see a use case to treat them
>>>> differently: Dependency modeling, debugging, giving credit -- isn't
>>>> for all of these the configuration file pretty much the same as the
>>>> raw instrument output (say)?
>>> 
>>> Well, I think that 'wasDerivedFrom' is meant to be used to just give you the
>>> main track, i.e. the main progenitors. So I expect a wasDerivedFrom
>>> relationship only to those input files of the generating activity that are
>>> the main inputs. E.g. if an image is corrected using a dark frame, then the
>>> image was derived from the raw image, not from the dark frame. But the raw
>>> image and dark frame are both inputs.
>> Well, but how exactly are they different?  When using Provenance,
>> * if it's about debugging, a problem might equally well result from
>>   an issue in the dark frame or the raw image (or perhaps an
>>   interesting interaction between border cases in both).
>> * if it's about dependency modeling, the output will have to be
>>   re-made whether it's the dark frame or the raw image that's
>>   changed.
>> * if it's about giving credit, I'd argue if a dark frame (or, say, a
>>   superflat) is done with enough deliberation that credit is given on
>>   it in the first place, then this should be preserved in further
>>   products, too.
>> So, I'd argue the notion of "main inputs" and "main outputs" is not
>> only *really* hard to define reproducibly, I also don't see a use
>> case in which that distinction actually helps.  And hence I'd propose
>> it should be dropped, with the useful by-product of removing one
>> thing that brings in wasDerivedFrom.
>>> Here's another use case for wasDerivedFrom:
>>> Imagine that you have two input images i1, i2 and two result images o1 and
>>> o2 for an activity. The wasDerivedFrom relationship can then tell you that
>>> o1 was derived from i1 and o2 from i2 (and not o1 from i2 or so). So it's
>>> adding more information.
>> Well, if i1 and i2 are both inputs to the activity, I'd expect both
>> of them to contribute to both o1 and o2, and so it would be wrong to
>> say that o2 only depends on i1, no?  If, on the other hand, the
>> activity can be decomposed so that o2 really only depends on i2 and
>> o1 only on i1, then I submit you should model it as two activities.
>> Either way, wasDerivedFrom is either misleading or superfluous.
>> Sorry for being a bit obnoxious here, but I expect once we get
>> Provenance right, it's going to crop up everywhere.  Hence, if we do
>> too much premature optimization and special-casing, that's going to
>> hurt in all these places.  But granted, so will a lack of pragmatism.
>>       -- Markus
> 
> -- 
> -------------------------------------------------------
> Dr. Kristin Riebe
> Press and Public Outreach
> 
> Email: kriebe at aip.de, webmaster at aip.de
> Phone: +49 331 7499-377
> Room:  Bib/3
> -------------------------------------------------------
> Leibniz-Institut für Astrophysik Potsdam (AIP)
> An der Sternwarte 16, D-14482 Potsdam
> Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
> Stiftung bürgerlichen Rechts
> Stiftungsverzeichnis Brandenburg: 26 742-00/7026
> -------------------------------------------------------