WasDerivedFrom vs. WasGeneratedBy

Tue Oct 24 22:35:09 CEST 2017

Hi all,

thanks for the discussions going on, it's really good for the model to have
comments based on experience and additional use cases / examples to think
about. I note that beyond the model, there is also a "good practive" that
should go with the model. We had long discussions on this during our
provenance meetings, but definitely not conclusive enough.

I see there will be a great benefit to have a provenance model with less
"options", and point users to one main good way to track the provenance.
Developing astronomy specific vocabulary is something we discussed too, and
from the discussion it seems we have more elements to do that now.

On the WadDerivedFrom relation, it never really occurred to me that it was
a way to point to the "main" progenitor(s). As you said in this discussion,
this is impossible to get right, selecting the main progenitor depends on
astronomy specific roles, and depends on the user (calibration products can
be science products for someone else). I see it more as a way to hide an
activity. Of course, we identified the redundancy of this relation in our
discussion. At the end, the reason to keep this relation was not based on
strong arguments, it is simply because it exists in the W3C and it seemed
cheap to implement. However, I find it a bit messy and definitely
misleading. Good practice would be to expose and decompose all activities,
even if it is a simple conversion or copy activity, so we are still lacking
a good use case for WasDerivedFrom that would justify keeping it in the
model.

For the WasInformedBy relation you exposed that it was simply a short-cut
to Used/WasGeneratedBypossible, to hide intermediate entities... but from
this discussion, and from earlier discussions in our meetings, I think this
would be "bad practice". One should clearly define the activities, if the
intermediate entity is not relevant, then the flow of activity may not be
right. However, there is a possible use case for this relation : imagine an
activity that simply have no generated entities, but that is necessary to
start another activity. For example, before observing, we first initialize
the camera of a telescope (or say we have a set_filter activity), and only
then we can run the acquisition. We could say that the set_filter activity
informed the acquisition activity that it can start. We should thus decide
if a dummy entity (result_status of the set_filter activity) should exist
or if we keep the WasInformedBy relation.

More on those relations here by the way, with a sentence defining their
meaning :
https://www.w3.org/ns/prov#W

Note that there are relations like WasEndedBy, of wasStartedBy (with a
trigger entity), and also a wasInfluencedBy relation, and a wasRevisionOf
relation that we don't cover, but could add useful features (and complexity
!)

Cheers,
Mathieu

2017-10-23 16:04 GMT-03:00 Kristin Riebe <kriebe at aip.de>:

> Hi Hugo, DM,
>
> thanks a lot for your use case and explanations! It's so great that people
> from different projects are joining in the discussion. That's really
> helpful.
>
> 1) Separate derivation and application of calibration parameters.
>>
>> Attached version of Kristin's astrometry example and is similar in idea
>> to Markus' suggestion: there is an extra entity containing the astrometric
>> solution. The draw.io <http://draw.io> version:
>> https://drive.google.com/file/d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA
>> /view?usp=sharing
>>
>
> It makes sense to see the derivation of a calibration parameter as a
>> separate activity from its application, and consider the calibration
>> parameter as a separate entity. This separation was very useful for KiDS
>> for many reasons, e.g. reusing the calibration parameters.
>>
>
> Oh right, reusing the calibration parameters is a good idea. I hadn't
> thought that far.
>
> Splitting up such calibration steps in two would also provide a practical
>> resolution to many problems that wasDerivedFrom was introduced for.
>>
>
> A (semi-)automated tool that traverses the provenance graph could for
>> example follow 'the pixels' and ignore non-pixel entities.
>>
>
> So the tool would need to have the possibility to distinguish between
> entities of different kinds (image/log/...), e.g. by using the attribute
> "category" (of EntityDescription).
>
> 2) Add some domain knowledge to the model and the tools.
>>
>> Much of the provenance DM working draft is not specific to astronomy at
>> all, and rightly so. However, this is an astronomy document, and the
>> question of 'what is the main progenitor' cannot be answered without
>> astronomical knowledge.
>>
>> One could add a bit of domain knowledge to the data model and the tool:
>> include in the entity-descriptions that the raw-entity and WCS- and
>> flat-identies are of 'different' kinds, e.g. 'science' and 'calibration'.
>> Then the tool could just follow only the 'science' entities.
>>
>
> Yeah, I guess that's the point where a common vocabulary to define what
> kind of entities exist would be really useful.
>
> We used this mechanism in KiDS where it was successful. Our provenance
>> graphs for a single coadd have literally millions of entities, but we can
>> still navigate them easily by ignoring 'calibration' data by default. That
>> is, tools will consider a flat as a progenitor, but will not traverse the
>> progenitors of the flat itself unless explicitly asked to.
>>
>
> That's interesting. We invented ProvDAL in order to have a service that
> can return (serialized) provenance information for a given entity. We were
> trying to make some sensible choices what data users expect to get back
> when asking for the provenance. Ignoring 'calibration' (in the sense of not
> tracking progenitors of a flat field or other auxiliary data) would be very
> useful indeed.
>
> This knowledge does not have to be part of the provenance data model
>> itself though. Related to the above, 'having pixels' is already domain
>> knowledge. Caveat: one persons calibration data is another persons science
>> data.
>>
>
> True enough. I think at least the distinction between an 'image' and
> parameters can be made safely.
>
> 3) The main-auxiliary distinction will become incredibly messy.
>>
>> Here are some other examples where it is hard to define the main and
>> auxiliary progenitor.
>>
>> Forced photometry: say you have a deep r-band image with perfect
>> astrometry and a shallow u-band image and want r-u colors. Then you can use
>> the r-band source positions to measure the flux in the u-band. Now what is
>> the main progenitor? My conclusion is the r-band image (or catalog) because
>> you've added knowledge to that main dataset by adding information from the
>> auxiliary dataset (similar as with flat-fielding). However, one could also
>> argue the other way around: the u-band image is the progenitor because most
>> of the information comes from that image.
>>
> >
>
>> Environment quantification (similar to the above): say one has a catalog
>> of interesting galaxies and another catalog with 'all' galaxies. Now this
>> second catalog is used to quantify the environment of the first set of
>> galaxies (e.g. by counting near neighbors or so). Now what is the main
>> progenitor? Again the first catalog in my opinion.
>>
>> I'm sure many people disagree with my assessments, that's the point.
>>
>
> It is allowed to have more than one 'main progenitor'; i.e. wasDerivedFrom
> can point back to more than just one progenitor entity. A very simple
> example is the composition of three images into an RGB image: here all
> three input images are equally important, and thus the composite is derived
> from each of them.
>
> 4) There are no unimportant activities.
>>
>> The problem of indicating the 'main' progenitor will not be solved by
>> wasDerivedFrom, as indicated above. But it does introduce a problem: now a
>> tool will have to follow both wasGeneratedBy /and/ wasDerivedFrom, because
>> apparently wasDerivedFrom is not a subset of wasGeneratedBy + Used because
>> of 'empty' activities.
>>
>
> The other reason for wasDerivedFrom is to hide/bypass unimportant
>> activities. This doesn't make sense to me. Every action should be in the
>> model, even if it is just a transformation of the data. Even the most
>> unimportant step can turn out to be very relevant but impossible to
>> reproduce if not properly modeled.
>>
>
> Okay, we could decide that wasDerivedFrom is only allowed to be used on
> top of an existing used/wasGeneratedBy relationship to improve this.
> But then it's really just an optional addition, and then Markus's argument
> comes into play: don't use optional stuff if you don't have to.
>
> So, well, if no one else is having a use case where wasDerivedFrom is
> desperately needed, I think we can remove it for now. We could still
> include it in a version 1.1 of the model, if the need arises.
>
> *) Conclusion
>>
>> In a direct Dutch way: From my perspective 'wasDerivedFrom' is often not
>> necessary (point 1, 2, 4), impossible to get right (1, 3), cannot be
>> trusted (3) and introduces complexity (4).
>>
>> It seems my mail and especially the conclusion can be interpreted
>> negatively, that was not the intent. The goal was to be constructive, by
>> sharing experiences, so we can have a great provenance model. Your idea
>> behind provenance and experiences might differ from mine, so please use the
>> information above how it bests suits you and proceed how you think is best.
>>
>
> I'm curious and I'd like to make use of your experience and ask some more
> questions:
> What does the provenance looks like when you retrieve it via your tools?
> I.e. for a given processed image, using your tools and Astrowise, what does
> the user get? Just a list of entities? Or parameters for the activities?
> It's all stored in a database, right? But users don't do direct database
> queries, do they?
>
> Would it be useful for you to exchange the retrieved provenance metadata
> with other tools/services? What kind of exchange format would you prefer?
> (E.g. one of the W3C serialisation formats PROV-JSON etc. or would you
> prefer something else?)
>
> Hmmm... maybe we should have one of the next provenance work group
> meetings in the Netherlands. :-)
>
> One more question for one of your points:
> You are saying "There are no unimportant activities." and I get your point
> here. Would you say the same for entities?
> Or are there activities for which the intermediate entities are
> unimportant?
> For example, image a pipeline, where you want to mention the substeps and
> all it's parameters explicitly, but the intermediate image is not stored
> (permanently) and thus it makes not much sense to create an entity for it.
> How do you model this?
>
>
> Cheers,
>
> Kristin
>
> --
> -------------------------------------------------------
> Dr. Kristin Riebe
> Press and Public Outreach
>
> Email: kriebe at aip.de, webmaster at aip.de
> Phone: +49 331 7499-377
> Room:  Bib/3
> -------------------------------------------------------
> Leibniz-Institut für Astrophysik Potsdam (AIP)
> An der Sternwarte 16, D-14482 Potsdam
> Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
> Stiftung bürgerlichen Rechts
> Stiftungsverzeichnis Brandenburg: 26 742-00/7026
> -------------------------------------------------------
>

-- 
Dr. Mathieu Servillat
Laboratoire Univers et Théories, Bât 18, Bur. 221
Observatoire de Paris-Meudon
5 place Jules Janssen
92195 Meudon, France
Tél. +33 1 45 07 74 32
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20171024/98eca964/attachment-0001.html>