<div dir="ltr"><span style="font-size:12.8px">Hi all,</span><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">thanks for the discussions going on, it's really good for the model to have comments based on experience and additional use cases / examples to think about. I note that beyond the model, there is also a "good practive" that should go with the model. We had long discussions on this during our provenance meetings, but definitely not conclusive enough.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">I see there will be a great benefit to have a provenance model with less "options", and point users to one main good way to track the provenance. Developing astronomy specific vocabulary is something we discussed too, and from the discussion it seems we have more elements to do that now.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">On the WadDerivedFrom relation, it never really occurred to me that it was a way to point to the "main" progenitor(s). As you said in this discussion, this is impossible to get right, selecting the main progenitor depends on astronomy specific roles, and depends on the user (calibration products can be science products for someone else). I see it more as a way to hide an activity. Of course, we identified the redundancy of this relation in our discussion. At the end, the reason to keep this relation was not based on strong arguments, it is simply because it exists in the W3C and it seemed cheap to implement. However, I find it a bit messy and definitely misleading. Good practice would be to expose and decompose all activities, even if it is a simple conversion or copy activity, so we are still lacking a good use case for WasDerivedFrom that would justify keeping it in the model.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">For the WasInformedBy relation you exposed that it was simply a short-cut to Used/WasGeneratedBypossible, to hide intermediate entities... but from this discussion, and from earlier discussions in our meetings, I think this would be "bad practice". One should clearly define the activities, if the intermediate entity is not relevant, then the flow of activity may not be right. However, there is a possible use case for this relation : imagine an activity that simply have no generated entities, but that is necessary to start another activity. For example, before observing, we first initialize the camera of a telescope (or say we have a set_filter activity), and only then we can run the acquisition. We could say that the set_filter activity informed the acquisition activity that it can start. We should thus decide if a dummy entity (result_status of the set_filter activity) should exist or if we keep the WasInformedBy relation.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">More on those relations here by the way, with a sentence defining their meaning :</div><div style="font-size:12.8px"><a href="https://www.w3.org/ns/prov#W" target="_blank">https://www.w3.org/ns/prov#W</a><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Note that there are relations like WasEndedBy, of wasStartedBy (with a trigger entity), and also a wasInfluencedBy relation, and a wasRevisionOf relation that we don't cover, but could add useful features (and complexity !)</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Cheers,</div><div style="font-size:12.8px">Mathieu</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-10-23 16:04 GMT-03:00 Kristin Riebe <span dir="ltr"><<a href="mailto:kriebe@aip.de" target="_blank">kriebe@aip.de</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Hugo, DM,<br>
<br>
thanks a lot for your use case and explanations! It's so great that people from different projects are joining in the discussion. That's really helpful.<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
1) Separate derivation and application of calibration parameters.<br>
<br></span>
Attached version of Kristin's astrometry example and is similar in idea to Markus' suggestion: there is an extra entity containing the astrometric solution. The <a href="http://draw.io" rel="noreferrer" target="_blank">draw.io</a> <<a href="http://draw.io" rel="noreferrer" target="_blank">http://draw.io</a>> version: <a href="https://drive.google.com/file/d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA/view?usp=sharing" rel="noreferrer" target="_blank">https://drive.google.com/file/<wbr>d/0BzoBp7N7YV9JZzVJOW9qVmlrWjA<wbr>/view?usp=sharing</a><br>
</blockquote><span class="">
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
It makes sense to see the derivation of a calibration parameter as a separate activity from its application, and consider the calibration parameter as a separate entity. This separation was very useful for KiDS for many reasons, e.g. reusing the calibration parameters. <br>
</blockquote>
<br></span>
Oh right, reusing the calibration parameters is a good idea. I hadn't thought that far.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Splitting up such calibration steps in two would also provide a practical resolution to many problems that wasDerivedFrom was introduced for.<br>
</blockquote>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
A (semi-)automated tool that traverses the provenance graph could for example follow 'the pixels' and ignore non-pixel entities.<br>
</blockquote>
<br></span>
So the tool would need to have the possibility to distinguish between entities of different kinds (image/log/...), e.g. by using the attribute "category" (of EntityDescription).<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2) Add some domain knowledge to the model and the tools.<br>
<br>
Much of the provenance DM working draft is not specific to astronomy at all, and rightly so. However, this is an astronomy document, and the question of 'what is the main progenitor' cannot be answered without astronomical knowledge.<br>
<br>
One could add a bit of domain knowledge to the data model and the tool: include in the entity-descriptions that the raw-entity and WCS- and flat-identies are of 'different' kinds, e.g. 'science' and 'calibration'. Then the tool could just follow only the 'science' entities.<br>
</blockquote>
<br></span>
Yeah, I guess that's the point where a common vocabulary to define what kind of entities exist would be really useful.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
We used this mechanism in KiDS where it was successful. Our provenance graphs for a single coadd have literally millions of entities, but we can still navigate them easily by ignoring 'calibration' data by default. That is, tools will consider a flat as a progenitor, but will not traverse the progenitors of the flat itself unless explicitly asked to.<br>
</blockquote>
<br></span>
That's interesting. We invented ProvDAL in order to have a service that can return (serialized) provenance information for a given entity. We were trying to make some sensible choices what data users expect to get back when asking for the provenance. Ignoring 'calibration' (in the sense of not tracking progenitors of a flat field or other auxiliary data) would be very useful indeed.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
This knowledge does not have to be part of the provenance data model itself though. Related to the above, 'having pixels' is already domain knowledge. Caveat: one persons calibration data is another persons science data.<br>
</blockquote>
<br></span>
True enough. I think at least the distinction between an 'image' and parameters can be made safely.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
3) The main-auxiliary distinction will become incredibly messy.<br>
<br>
Here are some other examples where it is hard to define the main and auxiliary progenitor.<br>
<br>
Forced photometry: say you have a deep r-band image with perfect astrometry and a shallow u-band image and want r-u colors. Then you can use the r-band source positions to measure the flux in the u-band. Now what is the main progenitor? My conclusion is the r-band image (or catalog) because you've added knowledge to that main dataset by adding information from the auxiliary dataset (similar as with flat-fielding). However, one could also argue the other way around: the u-band image is the progenitor because most of the information comes from that image.<br>
</blockquote>
><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Environment quantification (similar to the above): say one has a catalog of interesting galaxies and another catalog with 'all' galaxies. Now this second catalog is used to quantify the environment of the first set of galaxies (e.g. by counting near neighbors or so). Now what is the main progenitor? Again the first catalog in my opinion.<br>
<br>
I'm sure many people disagree with my assessments, that's the point.<br>
</blockquote>
<br></span>
It is allowed to have more than one 'main progenitor'; i.e. wasDerivedFrom can point back to more than just one progenitor entity. A very simple example is the composition of three images into an RGB image: here all three input images are equally important, and thus the composite is derived from each of them.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
4) There are no unimportant activities.<br>
<br>
The problem of indicating the 'main' progenitor will not be solved by wasDerivedFrom, as indicated above. But it does introduce a problem: now a tool will have to follow both wasGeneratedBy /and/ wasDerivedFrom, because apparently wasDerivedFrom is not a subset of wasGeneratedBy + Used because of 'empty' activities.<br>
</blockquote>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The other reason for wasDerivedFrom is to hide/bypass unimportant activities. This doesn't make sense to me. Every action should be in the model, even if it is just a transformation of the data. Even the most unimportant step can turn out to be very relevant but impossible to reproduce if not properly modeled.<br>
</blockquote>
<br></span>
Okay, we could decide that wasDerivedFrom is only allowed to be used on top of an existing used/wasGeneratedBy relationship to improve this.<br>
But then it's really just an optional addition, and then Markus's argument comes into play: don't use optional stuff if you don't have to.<br>
<br>
So, well, if no one else is having a use case where wasDerivedFrom is desperately needed, I think we can remove it for now. We could still include it in a version 1.1 of the model, if the need arises.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
*) Conclusion<br>
<br>
In a direct Dutch way: From my perspective 'wasDerivedFrom' is often not necessary (point 1, 2, 4), impossible to get right (1, 3), cannot be trusted (3) and introduces complexity (4).<br>
<br>
It seems my mail and especially the conclusion can be interpreted negatively, that was not the intent. The goal was to be constructive, by sharing experiences, so we can have a great provenance model. Your idea behind provenance and experiences might differ from mine, so please use the information above how it bests suits you and proceed how you think is best.<br>
</blockquote>
<br></span>
I'm curious and I'd like to make use of your experience and ask some more questions:<br>
What does the provenance looks like when you retrieve it via your tools? I.e. for a given processed image, using your tools and Astrowise, what does the user get? Just a list of entities? Or parameters for the activities?<br>
It's all stored in a database, right? But users don't do direct database queries, do they?<br>
<br>
Would it be useful for you to exchange the retrieved provenance metadata with other tools/services? What kind of exchange format would you prefer? (E.g. one of the W3C serialisation formats PROV-JSON etc. or would you prefer something else?)<br>
<br>
Hmmm... maybe we should have one of the next provenance work group meetings in the Netherlands. :-)<br>
<br>
One more question for one of your points:<br>
You are saying "There are no unimportant activities." and I get your point here. Would you say the same for entities?<br>
Or are there activities for which the intermediate entities are unimportant?<br>
For example, image a pipeline, where you want to mention the substeps and all it's parameters explicitly, but the intermediate image is not stored (permanently) and thus it makes not much sense to create an entity for it. How do you model this?<div class="HOEnZb"><div class="h5"><br>
<br>
Cheers,<br>
<br>
Kristin<br>
<br>
-- <br>
------------------------------<wbr>-------------------------<br>
Dr. Kristin Riebe<br>
Press and Public Outreach<br>
<br>
Email: <a href="mailto:kriebe@aip.de" target="_blank">kriebe@aip.de</a>, <a href="mailto:webmaster@aip.de" target="_blank">webmaster@aip.de</a><br>
Phone: <a href="tel:%2B49%20331%207499-377" value="+493317499377" target="_blank">+49 331 7499-377</a><br>
Room: Bib/3<br>
------------------------------<wbr>-------------------------<br>
Leibniz-Institut für Astrophysik Potsdam (AIP)<br>
An der Sternwarte 16, D-14482 Potsdam<br>
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker<br>
Stiftung bürgerlichen Rechts<br>
Stiftungsverzeichnis Brandenburg: 26 742-00/7026<br>
------------------------------<wbr>-------------------------<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Dr. Mathieu Servillat<div>Laboratoire Univers et Théories, Bât 18, Bur. 221</div><div>Observatoire de Paris-Meudon<br><div>5 place Jules Janssen</div><div>92195 Meudon, France</div><div>Tél. +33 1 45 07 74 32<br></div><div>--</div></div></div></div></div></div></div></div>
</div>