WasDerivedFrom vs. WasGeneratedBy

Mon Oct 23 21:04:17 CEST 2017

Hi Hugo, DM,

thanks a lot for your use case and explanations! It's so great that 
people from different projects are joining in the discussion. That's 
really helpful.

> 1) Separate derivation and application of calibration parameters.
> 
> Attached version of Kristin's astrometry example and is similar in idea 
> to Markus' suggestion: there is an extra entity containing the 
> astrometric solution. The draw.io <http://draw.io> version: 
> https://drive.google.com/file/d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA/view?usp=sharing

> It makes sense to see the derivation of a calibration parameter as a 
> separate activity from its application, and consider the calibration 
> parameter as a separate entity. This separation was very useful for KiDS 
> for many reasons, e.g. reusing the calibration parameters. 

Oh right, reusing the calibration parameters is a good idea. I hadn't 
thought that far.

> Splitting up 
> such calibration steps in two would also provide a practical resolution 
> to many problems that wasDerivedFrom was introduced for.

> A (semi-)automated tool that traverses the provenance graph could for 
> example follow 'the pixels' and ignore non-pixel entities.

So the tool would need to have the possibility to distinguish between 
entities of different kinds (image/log/...), e.g. by using the attribute 
"category" (of EntityDescription).

> 2) Add some domain knowledge to the model and the tools.
> 
> Much of the provenance DM working draft is not specific to astronomy at 
> all, and rightly so. However, this is an astronomy document, and the 
> question of 'what is the main progenitor' cannot be answered without 
> astronomical knowledge.
> 
> One could add a bit of domain knowledge to the data model and the tool: 
> include in the entity-descriptions that the raw-entity and WCS- and 
> flat-identies are of 'different' kinds, e.g. 'science' and 
> 'calibration'. Then the tool could just follow only the 'science' entities.

Yeah, I guess that's the point where a common vocabulary to define what 
kind of entities exist would be really useful.

> We used this mechanism in KiDS where it was successful. Our provenance 
> graphs for a single coadd have literally millions of entities, but we 
> can still navigate them easily by ignoring 'calibration' data by 
> default. That is, tools will consider a flat as a progenitor, but will 
> not traverse the progenitors of the flat itself unless explicitly asked to.

That's interesting. We invented ProvDAL in order to have a service that 
can return (serialized) provenance information for a given entity. We 
were trying to make some sensible choices what data users expect to get 
back when asking for the provenance. Ignoring 'calibration' (in the 
sense of not tracking progenitors of a flat field or other auxiliary 
data) would be very useful indeed.

> This knowledge does not have to be part of the provenance data model 
> itself though. Related to the above, 'having pixels' is already domain 
> knowledge. Caveat: one persons calibration data is another persons 
> science data.

True enough. I think at least the distinction between an 'image' and 
parameters can be made safely.

> 3) The main-auxiliary distinction will become incredibly messy.
> 
> Here are some other examples where it is hard to define the main and 
> auxiliary progenitor.
> 
> Forced photometry: say you have a deep r-band image with perfect 
> astrometry and a shallow u-band image and want r-u colors. Then you can 
> use the r-band source positions to measure the flux in the u-band. Now 
> what is the main progenitor? My conclusion is the r-band image (or 
> catalog) because you've added knowledge to that main dataset by adding 
> information from the auxiliary dataset (similar as with flat-fielding). 
> However, one could also argue the other way around: the u-band image is 
> the progenitor because most of the information comes from that image.
 >
> Environment quantification (similar to the above): say one has a catalog 
> of interesting galaxies and another catalog with 'all' galaxies. Now 
> this second catalog is used to quantify the environment of the first set 
> of galaxies (e.g. by counting near neighbors or so). Now what is the 
> main progenitor? Again the first catalog in my opinion.
> 
> I'm sure many people disagree with my assessments, that's the point.

It is allowed to have more than one 'main progenitor'; i.e. 
wasDerivedFrom can point back to more than just one progenitor entity. A 
very simple example is the composition of three images into an RGB 
image: here all three input images are equally important, and thus the 
composite is derived from each of them.

> 4) There are no unimportant activities.
> 
> The problem of indicating the 'main' progenitor will not be solved by 
> wasDerivedFrom, as indicated above. But it does introduce a problem: now 
> a tool will have to follow both wasGeneratedBy /and/ wasDerivedFrom, 
> because apparently wasDerivedFrom is not a subset of wasGeneratedBy + 
> Used because of 'empty' activities.

> The other reason for wasDerivedFrom is to hide/bypass unimportant 
> activities. This doesn't make sense to me. Every action should be in the 
> model, even if it is just a transformation of the data. Even the most 
> unimportant step can turn out to be very relevant but impossible to 
> reproduce if not properly modeled.

Okay, we could decide that wasDerivedFrom is only allowed to be used on 
top of an existing used/wasGeneratedBy relationship to improve this.
But then it's really just an optional addition, and then Markus's 
argument comes into play: don't use optional stuff if you don't have to.

So, well, if no one else is having a use case where wasDerivedFrom is 
desperately needed, I think we can remove it for now. We could still 
include it in a version 1.1 of the model, if the need arises.

> *) Conclusion
> 
> In a direct Dutch way: From my perspective 'wasDerivedFrom' is often not 
> necessary (point 1, 2, 4), impossible to get right (1, 3), cannot be 
> trusted (3) and introduces complexity (4).
> 
> It seems my mail and especially the conclusion can be interpreted 
> negatively, that was not the intent. The goal was to be constructive, by 
> sharing experiences, so we can have a great provenance model. Your idea 
> behind provenance and experiences might differ from mine, so please use 
> the information above how it bests suits you and proceed how you think 
> is best.

I'm curious and I'd like to make use of your experience and ask some 
more questions:
What does the provenance looks like when you retrieve it via your tools? 
I.e. for a given processed image, using your tools and Astrowise, what 
does the user get? Just a list of entities? Or parameters for the 
activities?
It's all stored in a database, right? But users don't do direct database 
queries, do they?

Would it be useful for you to exchange the retrieved provenance metadata 
with other tools/services? What kind of exchange format would you 
prefer? (E.g. one of the W3C serialisation formats PROV-JSON etc. or 
would you prefer something else?)

Hmmm... maybe we should have one of the next provenance work group 
meetings in the Netherlands. :-)

One more question for one of your points:
You are saying "There are no unimportant activities." and I get your 
point here. Would you say the same for entities?
Or are there activities for which the intermediate entities are unimportant?
For example, image a pipeline, where you want to mention the substeps 
and all it's parameters explicitly, but the intermediate image is not 
stored (permanently) and thus it makes not much sense to create an 
entity for it. How do you model this?

Cheers,

Kristin

-- 
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach

Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room:  Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------