New ProvenanceDM working draft released, part I

Tue Oct 17 23:19:30 CEST 2017

Hi Arnold,

I haven't looked at that yet.

Cheers,
Kristin

Am 16.10.2017 um 23:33 schrieb Arnold Rots:
> I have to apologize for not having followed the discussion, being too 
> busy with other subjects.
> But it occurred to me, as we are starting to use DataCite DOIs for data 
> citation, that the DataCite metadata contains some provenance items, too.
> Did anyone look to see whether the WD  proposal is consistent with those 
> items?
> 
> Cheers,
> 
>    - Arnold
> 
> On Oct 16, 2017 4:50 PM, "Kristin Riebe" <kriebe at aip.de 
> <mailto:kriebe at aip.de>> wrote:
> 
>     Hi Markus, DM,
> 
>         I think that would be a good thing altogether;  declaring proper
>         licenses might seem a fairly un-academic exercise and pretty much is
>         until you want to revive an orphaned dataset and get into trouble
>         with your legal department.  Or until you want a part of the data be
>         included with <popular software package of your choice>.
> 
>         ProvDM would be the natural place to have that.  VOResource 1.1
>         already has some language on this, and it'd be great if we could
>         sync
>         this between Provenance and Registry.
> 
> 
>     Okay, we could give it a try. I've put it on the TODO list at the
>     wiki page. (Just a reminder, it's
>     http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel
>     <http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel>.)
> 
>                 happen.  Are you absolutely sure you can't fix
>                 WasGeneratedBy/Used to
>                 cover what WasDerivedFrom is designed to do and then drop
>                 WasDerivedFrom?
> 
>             [...]
>             The main difference is: not every input of an activity that
>             generated an
>             entity will automatically have a "WasDerivedFrom", it's
>             semantically
>             different. E.g. an image is usually "derived from" another
>             image, but not
>             from "auxiliary" input like a configuration file or a
>             parameter (which were
>             also used as input from the generating activity).
> 
> 
>         Hm -- why not?  I don't really see a use case to treat them
>         differently: Dependency modeling, debugging, giving credit -- isn't
>         for all of these the configuration file pretty much the same as the
>         raw instrument output (say)?
> 
> 
>     Well, I think that 'wasDerivedFrom' is meant to be used to just give
>     you the main track, i.e. the main progenitors. So I expect a
>     wasDerivedFrom relationship only to those input files of the
>     generating activity that are the main inputs. E.g. if an image is
>     corrected using a dark frame, then the image was derived from the
>     raw image, not from the dark frame. But the raw image and dark frame
>     are both inputs.
> 
>     Here's another use case for wasDerivedFrom:
>     Imagine that you have two input images i1, i2 and two result images
>     o1 and o2 for an activity. The wasDerivedFrom relationship can then
>     tell you that o1 was derived from i1 and o2 from i2 (and not o1 from
>     i2 or so). So it's adding more information.
> 
>             In principle (i.e. I think it should work, but haven't
>             really tried it in an
>             implementation with realistic data) the "role" attribute to
>             Used and
>             WasGeneratedBy, together with the corresponding links to
>             description classes
>             can be used to express which entity was derived from which
>             progenitor
>             entity, even without the explicit WasDerivedFrom link. But
>             doing it this way
>             would be a huge overhead for those use cases where
>             description classes are
>             not needed.
> 
> 
>         I'm not sure I understand how that overhead comes about -- is it
>         because you can define dependencies in bulk in the description?
> 
>         Perhaps an example might help here?
> 
> 
>     I need more time to work out a good example. The idea is, that you
>     can predefine the expected input and output datatypes for each
>     activity using ActivityDescription, EntityDescription and their
>     relations.
>     And thus you know which of the input data is auxiliary data/config
>     file/dark frame/raw image/..., which is indicated by the
>     role-attribute of the corresponding used-relation. This can help to
>     find out which data entity is progenitor of another, even without
>     the wasDerivedFrom relationship.
> 
>             Similarly: what if you are not interested in the actual
>             processing step, but
>             just want to record that one image was derived from another,
>             without any
>             further information? (e.g. copying process, simple format
>             conversion). If we
>             insist on using the Used/WasGeneratedBy construct always,
>             then even for
>             those simple cases one needs to define "empty" activities,
>             which then
>             blow-up the serialisations.
> 
> 
>         True.  But that may be a price worth paying if it streamlines client
>         code (that presumably would have to re-introduce the empty
>         activities
>         when parsing such declarations, or their code will be rife with
>         special cases) and, in particular, query patterns in ProvTAP (where
>         two different ways to do the same thing usually require UNION, which
>         ADQL 2.0 doesn't have and that's a long way from becoming mandatory
>         in any form).
> 
> 
>     Yeah, see, I always had only the serialization formats like
>     PROV-JSON in mind, which could be put into the header of a file to
>     keep the provenance information with the data. And I imagine how
>     ugly that serialization looks with empty entities/activities all
>     over the place and all the additional relation entries which are
>     then required.
> 
>     But yes, multiple ways to do things are making TAP queries much
>     harder (UNION, *sigh* ...).
> 
>     Hm - just another thought: imagine we drop wasDerivedFrom and
>     wasInformedBy from the model. Could we then just re-introduce
>     wasDerivedFrom/wasInformedBy only in the serializations (for those
>     cases with empty in-between activities/entities), since
>     wasDerivedFrom is a valid construct in W3C serializations? So we
>     just use it in order to optimize serializations? Would that make any
>     sense or confuse everyone completely?
> 
>             Also, W3C tools can interprete WasDerivedFrom-relations
>             (since it's borrowed
>             from W3C), but wouldn't be able to "understand" it, if it's
>             hidden in the
>             roles and description classes.
> 
> 
>         That's a serious issue.  The problem with the description classes I
>         understand, but that I think is a minor issue; whatever
>         progenitor is
>         declared in a description class is probably largely formal in the
>         first place (as all instances depend on it).
> 
>         The problem with the roles I don't understand, but probably
>         because I
>         don't really know much about the different models.
> 
> 
>     You could specify in the descriptions that the input data with
>     used.role='r1' is the progenitor of the output data with
>     wasGeneratedBy.role='r2'. A VO client may gain knowledge about this,
>     but a W3C tool wouldn't know about the special meanings of these
>     roles, and thus couldn't give any information about direct
>     progenitors of the output data.
> 
>         But doesn't the
>         W3C have similar problems, too?  Why can't we do as they do?
> 
> 
>     If we do it as in the W3C model, then we need to keep wasDerivedFrom
>     and wasInformedBy.
> 
>             We introduced WasInformedBy (again borrowed from W3C) based
>             on use cases
>             that describe pipelines, chains of activities, where
>             defining and recording
>             the intermediate entities is not needed. In that sense,
>             WasInformedBy is a
>             short-cut to Used/WasGeneratedBy again, but in contrast to
>             WasDerivedFrom it
>             does not provide any further insights. It's really just
>             meant to be used as
>             a short-cut when intermediate entities are unimportant.
> 
> 
>         But can't this be replaced by a single Activity then, taking the
>         inputs of the first pipeline element as inputs and producing the
>         output(s) of the last pipeline element?  How would that be more
>         complicated?
> 
> 
>     Yes, that can be done and that's what the "ActivityFlow" is used
>     for. Its individual steps (activities) may be important, however, in
>     order to know what has been done to the input dataset and in which
>     order. Here the hadStep relations are used to link member-activities
>     with their activityFlow and wasInformedBy is used to chain the
>     member-activities together in the correct order.
> 
>     E.g. If I want to retrieve all images from a database where dark
>     frame is subtracted, then I could search for all entities which at
>     some point in their history had an activity of type 'dark frame
>     correction' or similar. But if there is just one activity
>     'calibration', then I am missing the finer details (which
>     calibrations steps were done).
> 
>     Ok, maybe we could also come up with a standard way how to put all
>     the attributes and parameters of individual steps into one big
>     activity (in the case that minor steps are unimportant) in order to
>     make still visible what happened to the data without explicitly
>     modelling these steps ... I think we need more examples from use
>     cases to decide this.
> 
>     Cheers,
>     Kristin
> 
>     -- 
>     -------------------------------------------------------
>     Dr. Kristin Riebe
>     Press and Public Outreach
> 
>     Email: kriebe at aip.de <mailto:kriebe at aip.de>, webmaster at aip.de
>     <mailto:webmaster at aip.de>
>     Phone: +49 331 7499-377 <tel:%2B49%20331%207499-377>
>     Room:  Bib/3
>     -------------------------------------------------------
>     Leibniz-Institut für Astrophysik Potsdam (AIP)
>     An der Sternwarte 16, D-14482 Potsdam
>     Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
>     Stiftung bürgerlichen Rechts
>     Stiftungsverzeichnis Brandenburg: 26 742-00/7026
>     -------------------------------------------------------
> 

-- 
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach

Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room:  Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------