New ProvenanceDM working draft released, part I
Kristin Riebe
kriebe at aip.de
Tue Oct 17 22:57:48 CEST 2017
Hi Markus, DM,
so in summary: you argue that all output entities are derived from all
input entities, and it doesn't matter if they are just log files, config
files, whatever.
Well, I imagined the use case that I have an image and just would like
to get all the progenitor images without any other
logs/configs/auxiliary files. Because I just want to reprocess the image
from a certain step onwards, regardless of the previously used parameters.
But yes, even for this case it wouldn't hurt to get the additional
auxiliary files etc. as well. So maybe you're right and wasDerivedFrom
is not needed for this.
You argue from the provenance discovery point, which becomes more
difficult if there are multiple ways to get something.
Especially concerning the "short-cut" relationship "wasInformedBy"
(between activities), I usually considered it from the provenance
recording point of view.
WasInformedBy was introduced to facilitate creating provenance metadata.
If it doesn't take too much effort, then people are more willing to
actually to it.
I can hear you already arguing that it's easier for everyone if there is
just one well-defined way to write provenance. ;-)
And I guess even if this looks then more complicated than necessary
(with empty activities/entities here and there), a provenance recording
tool may help to hide this from the user.
So in that sense a pipeline author who wants to record steps of a
pipeline but does not want or need to record any intermediate entities,
could write a function "wasInformedBy" for optimization. And this
function then in fact transforms the wasInformedBy relationship into a
used-relation, dummy entity and a wasGeneratedBy relation.
I'm not sure if I really want to have such dummy entities and activities
(for which we even need to come up with unique identifiers) creeping in
everywhere (in codes, serializations and even in TAP responses.)
Maybe we should ask some more people writing pipelines how they'd
imagine to record provenance. In fact, I also do not need wasDerivedFrom
or wasInformedBy in my use case, but I'd like to have the opinion of
some more people before we really decide to drop them.
Cheers,
Kristin
Am 17.10.2017 um 09:25 schrieb Markus Demleitner:
> Hi Kristin, Hi DM,
>
> I think I'd like to dwell on the relationship between
> wasGeneratedBy/Used and WasDerivedFrom a bit more in that:
>
> On Mon, Oct 16, 2017 at 10:50:13PM +0200, Kristin Riebe wrote:
>>>>> happen. Are you absolutely sure you can't fix WasGeneratedBy/Used to
>>>>> cover what WasDerivedFrom is designed to do and then drop
>>>>> WasDerivedFrom?
>>>> [...]
>>>> The main difference is: not every input of an activity that generated an
>>>> entity will automatically have a "WasDerivedFrom", it's semantically
>>>> different. E.g. an image is usually "derived from" another image, but not
>>>> from "auxiliary" input like a configuration file or a parameter (which were
>>>> also used as input from the generating activity).
>>>
>>> Hm -- why not? I don't really see a use case to treat them
>>> differently: Dependency modeling, debugging, giving credit -- isn't
>>> for all of these the configuration file pretty much the same as the
>>> raw instrument output (say)?
>>
>> Well, I think that 'wasDerivedFrom' is meant to be used to just give you the
>> main track, i.e. the main progenitors. So I expect a wasDerivedFrom
>> relationship only to those input files of the generating activity that are
>> the main inputs. E.g. if an image is corrected using a dark frame, then the
>> image was derived from the raw image, not from the dark frame. But the raw
>> image and dark frame are both inputs.
>
> Well, but how exactly are they different? When using Provenance,
>
> * if it's about debugging, a problem might equally well result from
> an issue in the dark frame or the raw image (or perhaps an
> interesting interaction between border cases in both).
> * if it's about dependency modeling, the output will have to be
> re-made whether it's the dark frame or the raw image that's
> changed.
> * if it's about giving credit, I'd argue if a dark frame (or, say, a
> superflat) is done with enough deliberation that credit is given on
> it in the first place, then this should be preserved in further
> products, too.
>
> So, I'd argue the notion of "main inputs" and "main outputs" is not
> only *really* hard to define reproducibly, I also don't see a use
> case in which that distinction actually helps. And hence I'd propose
> it should be dropped, with the useful by-product of removing one
> thing that brings in wasDerivedFrom.
>
>> Here's another use case for wasDerivedFrom:
>> Imagine that you have two input images i1, i2 and two result images o1 and
>> o2 for an activity. The wasDerivedFrom relationship can then tell you that
>> o1 was derived from i1 and o2 from i2 (and not o1 from i2 or so). So it's
>> adding more information.
>
> Well, if i1 and i2 are both inputs to the activity, I'd expect both
> of them to contribute to both o1 and o2, and so it would be wrong to
> say that o2 only depends on i1, no? If, on the other hand, the
> activity can be decomposed so that o2 really only depends on i2 and
> o1 only on i1, then I submit you should model it as two activities.
> Either way, wasDerivedFrom is either misleading or superfluous.
>
> Sorry for being a bit obnoxious here, but I expect once we get
> Provenance right, it's going to crop up everywhere. Hence, if we do
> too much premature optimization and special-casing, that's going to
> hurt in all these places. But granted, so will a lack of pragmatism.
>
>
> -- Markus
>
--
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach
Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room: Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------
More information about the dm
mailing list