IVOA Provenance DM -RFC- answers to comments

Ole Streicher ole at aip.de
Mon Nov 26 14:14:42 CET 2018


Hi Markus, Mathieu,

On 22.11.18 14:59, Markus Demleitner wrote:
> On Wed, Nov 21, 2018 at 05:45:56PM +0100, Mathieu Servillat wrote:
>> you are right, technically they are both progenitors. Now, for a more
>> precise example, let's define an activity that takes as inputs A, B and C
>> and returns as results D and E. The internal calculations are in fact D=A+B
>> and E=C+D. In that case D is *not* derived from C... You may say that the
>> activity is then not well defined, that D is an intermediate result, but
>> this is the real world, and one cannot be forced to do fine grained
>> provenance. This was summarized in the PR by "If there is more than one
> 
> But is it realistic to expect that a data provider who doesn't care
> to take apart these two activities will properly add wasDerivedFrom
> relationships?  And even if they do: Is what we gain from this type
> of information worth the cost in added complexity?
> 
> I guess what I'm saying is: In the interest of our future client
> writers, we should try an implementation of the "is A a progenitor of
> B?" functionality (which I'd say is undisputed as an important use
> case) with and without wasDerivedFrom, also testing it with a
> provenance graph with cycles (ideally a way that we can embed it into
> SQL/ADQL engines).  
> 
> If we have that and find wasDerivedFrom is still worth it, I'm fine
> with keeping it in.

In many astronomical uses, there is an indicator that the inputs of an
activity are not equal: the "DATE-OBS" field (and similar others) in the
FITS header. Obviously, for many activities there is one input where the
DATE-OBS field is preserved in the output, while for the others, it is
removed. I could guess that this one (or few) inputs is what people
usually refer as the "main progenitor".

However, IMO this is a specific attribute of the "usage" for that input
and not a special relation. When we can agree on putting this as a usage
category (no idea for a good name, however), then we could solve the use
case "I want to trace back my exposure to its raw input".

Independent of this, both wasDerivedFrom and wasInformedBy have some
uses: One standard use is f.e. when the activity is a "Selector".
Imagine an Activity selects (based f.e. on time) the right calibration
data set, and you want to document that in prov: Then the selected
Entity "was derived from" one of its input entities. You can't put that
into a role; it is additional information.

The danger is however, that this mixes up with the category "main input"
for the usage. If we don't make a clear statement here, there is a
danger that a client has to follow both paths, and even then fail: Think
of a science activity that uses one of two input calibrations for one
output, and one wants to describe that -- how should a client
distinguish this from the "main progenitor" case?

One use could be a shortcut on the description side: Assume you have an
ActivityDescription (so, basically a piece of software) that has several
releases. Then, the new version "was derived from" the old version.
Ofcourse, you may think of an activity ("Update") that took the old
version as input and produced the new version -- but do you really want
that? That would make a query like "give me all Activities that
correspond to scipost/1.0 or its progenitors" much more complicated,
without any gain.

But also this comes with a danger; one may think of the same logic for
manually maintained calibration files. Removing the "edit" activity
there will require the client tries both the way on used/wasgeneratedby
and wasderivedfrom to find out what the progenitor is, with the same
ambiguities as above (was it a selection or a progenitor?)

--> IMO we should not remove wasDerivedFrom from the model, but make
clear that it provides *additional* information like a selection. And
instead of its usage as main progenitor, we should put a category on the
"use"(Description) relation.

Best

Ole


More information about the dm mailing list