New ProvenanceDM working draft released, part I

Tue Oct 17 09:25:03 CEST 2017

Hi Kristin, Hi DM,

I think I'd like to dwell on the relationship between
wasGeneratedBy/Used and WasDerivedFrom a bit more in that:

On Mon, Oct 16, 2017 at 10:50:13PM +0200, Kristin Riebe wrote:
> > > > happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
> > > > cover what WasDerivedFrom is designed to do and then drop
> > > > WasDerivedFrom?
> > > [...]
> > > The main difference is: not every input of an activity that generated an
> > > entity will automatically have a "WasDerivedFrom", it's semantically
> > > different. E.g. an image is usually "derived from" another image, but not
> > > from "auxiliary" input like a configuration file or a parameter (which were
> > > also used as input from the generating activity).
> > 
> > Hm -- why not?  I don't really see a use case to treat them
> > differently: Dependency modeling, debugging, giving credit -- isn't
> > for all of these the configuration file pretty much the same as the
> > raw instrument output (say)?
> 
> Well, I think that 'wasDerivedFrom' is meant to be used to just give you the
> main track, i.e. the main progenitors. So I expect a wasDerivedFrom
> relationship only to those input files of the generating activity that are
> the main inputs. E.g. if an image is corrected using a dark frame, then the
> image was derived from the raw image, not from the dark frame. But the raw
> image and dark frame are both inputs.

Well, but how exactly are they different?  When using Provenance,

* if it's about debugging, a problem might equally well result from
  an issue in the dark frame or the raw image (or perhaps an
  interesting interaction between border cases in both).
* if it's about dependency modeling, the output will have to be
  re-made whether it's the dark frame or the raw image that's
  changed.
* if it's about giving credit, I'd argue if a dark frame (or, say, a
  superflat) is done with enough deliberation that credit is given on
  it in the first place, then this should be preserved in further
  products, too.

So, I'd argue the notion of "main inputs" and "main outputs" is not
only *really* hard to define reproducibly, I also don't see a use
case in which that distinction actually helps.  And hence I'd propose
it should be dropped, with the useful by-product of removing one
thing that brings in wasDerivedFrom.

> Here's another use case for wasDerivedFrom:
> Imagine that you have two input images i1, i2 and two result images o1 and
> o2 for an activity. The wasDerivedFrom relationship can then tell you that
> o1 was derived from i1 and o2 from i2 (and not o1 from i2 or so). So it's
> adding more information.

Well, if i1 and i2 are both inputs to the activity, I'd expect both
of them to contribute to both o1 and o2, and so it would be wrong to
say that o2 only depends on i1, no?  If, on the other hand, the
activity can be decomposed so that o2 really only depends on i2 and
o1 only on i1, then I submit you should model it as two activities.
Either way, wasDerivedFrom is either misleading or superfluous.

Sorry for being a bit obnoxious here, but I expect once we get
Provenance right, it's going to crop up everywhere.  Hence, if we do
too much premature optimization and special-casing, that's going to
hurt in all these places.  But granted, so will a lack of pragmatism.

      -- Markus