New ProvenanceDM working draft released, part I

Kristin Riebe kriebe at aip.de
Mon Oct 16 22:50:13 CEST 2017


Hi Markus, DM,

> I think that would be a good thing altogether;  declaring proper
> licenses might seem a fairly un-academic exercise and pretty much is
> until you want to revive an orphaned dataset and get into trouble
> with your legal department.  Or until you want a part of the data be
> included with <popular software package of your choice>.
> 
> ProvDM would be the natural place to have that.  VOResource 1.1
> already has some language on this, and it'd be great if we could sync
> this between Provenance and Registry.

Okay, we could give it a try. I've put it on the TODO list at the wiki 
page. (Just a reminder, it's 
http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel.)

>>> happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
>>> cover what WasDerivedFrom is designed to do and then drop
>>> WasDerivedFrom?
>> [...]
>> The main difference is: not every input of an activity that generated an
>> entity will automatically have a "WasDerivedFrom", it's semantically
>> different. E.g. an image is usually "derived from" another image, but not
>> from "auxiliary" input like a configuration file or a parameter (which were
>> also used as input from the generating activity).
> 
> Hm -- why not?  I don't really see a use case to treat them
> differently: Dependency modeling, debugging, giving credit -- isn't
> for all of these the configuration file pretty much the same as the
> raw instrument output (say)?

Well, I think that 'wasDerivedFrom' is meant to be used to just give you 
the main track, i.e. the main progenitors. So I expect a wasDerivedFrom 
relationship only to those input files of the generating activity that 
are the main inputs. E.g. if an image is corrected using a dark frame, 
then the image was derived from the raw image, not from the dark frame. 
But the raw image and dark frame are both inputs.

Here's another use case for wasDerivedFrom:
Imagine that you have two input images i1, i2 and two result images o1 
and o2 for an activity. The wasDerivedFrom relationship can then tell 
you that o1 was derived from i1 and o2 from i2 (and not o1 from i2 or 
so). So it's adding more information.

>> In principle (i.e. I think it should work, but haven't really tried it in an
>> implementation with realistic data) the "role" attribute to Used and
>> WasGeneratedBy, together with the corresponding links to description classes
>> can be used to express which entity was derived from which progenitor
>> entity, even without the explicit WasDerivedFrom link. But doing it this way
>> would be a huge overhead for those use cases where description classes are
>> not needed.
> 
> I'm not sure I understand how that overhead comes about -- is it
> because you can define dependencies in bulk in the description?
> 
> Perhaps an example might help here?

I need more time to work out a good example. The idea is, that you can 
predefine the expected input and output datatypes for each activity 
using ActivityDescription, EntityDescription and their relations.
And thus you know which of the input data is auxiliary data/config 
file/dark frame/raw image/..., which is indicated by the role-attribute 
of the corresponding used-relation. This can help to find out which data 
entity is progenitor of another, even without the wasDerivedFrom 
relationship.

>> Similarly: what if you are not interested in the actual processing step, but
>> just want to record that one image was derived from another, without any
>> further information? (e.g. copying process, simple format conversion). If we
>> insist on using the Used/WasGeneratedBy construct always, then even for
>> those simple cases one needs to define "empty" activities, which then
>> blow-up the serialisations.
> 
> True.  But that may be a price worth paying if it streamlines client
> code (that presumably would have to re-introduce the empty activities
> when parsing such declarations, or their code will be rife with
> special cases) and, in particular, query patterns in ProvTAP (where
> two different ways to do the same thing usually require UNION, which
> ADQL 2.0 doesn't have and that's a long way from becoming mandatory
> in any form).

Yeah, see, I always had only the serialization formats like PROV-JSON in 
mind, which could be put into the header of a file to keep the 
provenance information with the data. And I imagine how ugly that 
serialization looks with empty entities/activities all over the place 
and all the additional relation entries which are then required.

But yes, multiple ways to do things are making TAP queries much harder 
(UNION, *sigh* ...).

Hm - just another thought: imagine we drop wasDerivedFrom and 
wasInformedBy from the model. Could we then just re-introduce 
wasDerivedFrom/wasInformedBy only in the serializations (for those cases 
with empty in-between activities/entities), since wasDerivedFrom is a 
valid construct in W3C serializations? So we just use it in order to 
optimize serializations? Would that make any sense or confuse everyone 
completely?

>> Also, W3C tools can interprete WasDerivedFrom-relations (since it's borrowed
>> from W3C), but wouldn't be able to "understand" it, if it's hidden in the
>> roles and description classes.
> 
> That's a serious issue.  The problem with the description classes I
> understand, but that I think is a minor issue; whatever progenitor is
> declared in a description class is probably largely formal in the
> first place (as all instances depend on it).
> 
> The problem with the roles I don't understand, but probably because I
> don't really know much about the different models.  

You could specify in the descriptions that the input data with 
used.role='r1' is the progenitor of the output data with 
wasGeneratedBy.role='r2'. A VO client may gain knowledge about this, but 
a W3C tool wouldn't know about the special meanings of these roles, and 
thus couldn't give any information about direct progenitors of the 
output data.

> But doesn't the
> W3C have similar problems, too?  Why can't we do as they do?

If we do it as in the W3C model, then we need to keep wasDerivedFrom and 
wasInformedBy.

>> We introduced WasInformedBy (again borrowed from W3C) based on use cases
>> that describe pipelines, chains of activities, where defining and recording
>> the intermediate entities is not needed. In that sense, WasInformedBy is a
>> short-cut to Used/WasGeneratedBy again, but in contrast to WasDerivedFrom it
>> does not provide any further insights. It's really just meant to be used as
>> a short-cut when intermediate entities are unimportant.
> 
> But can't this be replaced by a single Activity then, taking the
> inputs of the first pipeline element as inputs and producing the
> output(s) of the last pipeline element?  How would that be more
> complicated?

Yes, that can be done and that's what the "ActivityFlow" is used for. 
Its individual steps (activities) may be important, however, in order to 
know what has been done to the input dataset and in which order. Here 
the hadStep relations are used to link member-activities with their 
activityFlow and wasInformedBy is used to chain the member-activities 
together in the correct order.

E.g. If I want to retrieve all images from a database where dark frame 
is subtracted, then I could search for all entities which at some point 
in their history had an activity of type 'dark frame correction' or 
similar. But if there is just one activity 'calibration', then I am 
missing the finer details (which calibrations steps were done).

Ok, maybe we could also come up with a standard way how to put all the 
attributes and parameters of individual steps into one big activity (in 
the case that minor steps are unimportant) in order to make still 
visible what happened to the data without explicitly modelling these 
steps ... I think we need more examples from use cases to decide this.

Cheers,
Kristin

-- 
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach

Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room:  Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------


More information about the dm mailing list