Astro-WISE perspective on provenance, Was: WasDerivedFrom vs. WasGeneratedBy

Hugo Buddelmeijer hugo at buddelmeijer.nl
Fri Nov 3 15:11:37 CET 2017


Dear Kristin, DM,

Some replies to your questions, showing my general view on provenance (and
data models); maybe it is useful to you as another perspective. So not
really applicable to the Provenance DM draft directly; I might get back to
that later.

On Mon, Oct 23, 2017 at 9:04 PM, Kristin Riebe <kriebe at aip.de> wrote:

> Hi Hugo, DM,
>
> thanks a lot for your use case and explanations! It's so great that people
> from different projects are joining in the discussion. That's really
> helpful.
>

I'm curious and I'd like to make use of your experience and ask some more
> questions:
>
What does the provenance looks like when you retrieve it via your tools?
> I.e. for a given processed image, using your tools and Astrowise, what does
> the user get? Just a list of entities? Or parameters for the activities?
> It's all stored in a database, right? But users don't do direct database
> queries, do they?
>

The provenance is an integral part of the system (Astro-WISE) so normally
there is no specific action a user takes to 'get' the provenance. The
default interface is through Pyhton: every data product corresponds to a
Python object with (lazy) properties that refer to its dependencies. Also,
the activity is implicit in the entity: that is, each Python object that
represents a data product has a make() method that (re)creates the data
that corresponds to the product by (re)processing the dependencies. All
parameters are also properties of the object.

The other main interface we have is a web-based database viewer that also
links all objects to their dependencies through normal html hyperlinks.
Users can enter free form SQL there as well chaining dependencies through
table joins. (Normally, users would at best alter SQL that was
automatically generated, not type them from scratch.)

Would it be useful for you to exchange the retrieved provenance metadata
> with other tools/services? What kind of exchange format would you prefer?
> (E.g. one of the W3C serialisation formats PROV-JSON etc. or would you
> prefer something else?)
>

That is actually why I'm on this list :-). I've written some crappy XML
serializations for some proof of concept work (using SAMP), but that was
not sufficient and didn't follow any real standard. So I'm here to learn
more.

There is one preference I have though. In Astro-WISE there is no real
difference between a workflow to create a new data product and the
provenance of an existing data product. A to-be created data product is
just like a created one without having the make() method called
(recursively if necessary). So what I'd like is a mechanism that (somehow)
supports this workflow-provenance duality. For example that you could
easily reuse the provenance of an existing data product to create a new
data product (after changing a parameter or so).

(If it were up to me, I would not use past tense like 'wasDerivedFrom',
'wasGeneratedBy' and 'used', but nouns like 'progenitor', 'generator',
'dependency', that way the same terminology can be used for provenance as
well as workflows. But this is just cosmetics and philosophy.)



> Hmmm... maybe we should have one of the next provenance work group
> meetings in the Netherlands. :-)
>

That would be great. We are not that active in the IVOA at the moment, so
on the one hand such a meeting would be a good opportunity to get us more
involved, but on the other hand might make it hard to create momentum to
actually organize it.



> One more question for one of your points:
> You are saying "There are no unimportant activities." and I get your point
> here. Would you say the same for entities?
> Or are there activities for which the intermediate entities are
> unimportant?
> For example, image a pipeline, where you want to mention the substeps and
> all it's parameters explicitly, but the intermediate image is not stored
> (permanently) and thus it makes not much sense to create an entity for it.
> How do you model this?


What we did with KiDS (and other data in Astro-WISE) is to combine steps
together if we didn't want to keep intermediate data. E.g. we have a
'ReducedScienceFrame' that is created with a single activity, that has as
input the RawScienceFrame and all relevant calibration data,
MasterFlatFrame, BiasFrame, IlluminationCorrection, etc. That is, the
activities are scoped such that we'd always want to keep the resulting
entities (and we do so).

My personal opinion is that this approach of combining things together is a
mistake, exactly because of the provenance. I'd prefer to have separate
activities and entities for all the intermediate steps, and have those
objects stored in the database, but only store the actual pixels when
desirable. The pixels can be (re)generated if necessary because full
provenance is available: storing pixels simply becomes a useful
optimization.

What I ultimately would like is to have the tools be able to combine/split
entities/activities automatically. E.g. that zoomed out you'd only see the
major branches of the provenance graph, and that branches split into
smaller and smaller activities and entities if you zoom in. (Where this
'zooming' and 'splitting/combining' would not just be a representational
thing, but actually represents how the system works internally.) Some day
I'll write this down, it doesn't have to be hard :-).

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20171103/47cf5516/attachment.html>


More information about the dm mailing list