New ProvenanceDM working draft released, part I

Arnold Rots arots at cfa.harvard.edu
Mon Oct 16 23:33:37 CEST 2017


I have to apologize for not having followed the discussion, being too busy
with other subjects.
But it occurred to me, as we are starting to use DataCite DOIs for data
citation, that the DataCite metadata contains some provenance items, too.
Did anyone look to see whether the WD  proposal is consistent with those
items?

Cheers,

  - Arnold

On Oct 16, 2017 4:50 PM, "Kristin Riebe" <kriebe at aip.de> wrote:

> Hi Markus, DM,
>
> I think that would be a good thing altogether;  declaring proper
>> licenses might seem a fairly un-academic exercise and pretty much is
>> until you want to revive an orphaned dataset and get into trouble
>> with your legal department.  Or until you want a part of the data be
>> included with <popular software package of your choice>.
>>
>> ProvDM would be the natural place to have that.  VOResource 1.1
>> already has some language on this, and it'd be great if we could sync
>> this between Provenance and Registry.
>>
>
> Okay, we could give it a try. I've put it on the TODO list at the wiki
> page. (Just a reminder, it's http://wiki.ivoa.net/twiki/bin
> /view/IVOA/ObservationProvenanceDataModel.)
>
> happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
>>>> cover what WasDerivedFrom is designed to do and then drop
>>>> WasDerivedFrom?
>>>>
>>> [...]
>>> The main difference is: not every input of an activity that generated an
>>> entity will automatically have a "WasDerivedFrom", it's semantically
>>> different. E.g. an image is usually "derived from" another image, but not
>>> from "auxiliary" input like a configuration file or a parameter (which
>>> were
>>> also used as input from the generating activity).
>>>
>>
>> Hm -- why not?  I don't really see a use case to treat them
>> differently: Dependency modeling, debugging, giving credit -- isn't
>> for all of these the configuration file pretty much the same as the
>> raw instrument output (say)?
>>
>
> Well, I think that 'wasDerivedFrom' is meant to be used to just give you
> the main track, i.e. the main progenitors. So I expect a wasDerivedFrom
> relationship only to those input files of the generating activity that are
> the main inputs. E.g. if an image is corrected using a dark frame, then the
> image was derived from the raw image, not from the dark frame. But the raw
> image and dark frame are both inputs.
>
> Here's another use case for wasDerivedFrom:
> Imagine that you have two input images i1, i2 and two result images o1 and
> o2 for an activity. The wasDerivedFrom relationship can then tell you that
> o1 was derived from i1 and o2 from i2 (and not o1 from i2 or so). So it's
> adding more information.
>
> In principle (i.e. I think it should work, but haven't really tried it in
>>> an
>>> implementation with realistic data) the "role" attribute to Used and
>>> WasGeneratedBy, together with the corresponding links to description
>>> classes
>>> can be used to express which entity was derived from which progenitor
>>> entity, even without the explicit WasDerivedFrom link. But doing it this
>>> way
>>> would be a huge overhead for those use cases where description classes
>>> are
>>> not needed.
>>>
>>
>> I'm not sure I understand how that overhead comes about -- is it
>> because you can define dependencies in bulk in the description?
>>
>> Perhaps an example might help here?
>>
>
> I need more time to work out a good example. The idea is, that you can
> predefine the expected input and output datatypes for each activity using
> ActivityDescription, EntityDescription and their relations.
> And thus you know which of the input data is auxiliary data/config
> file/dark frame/raw image/..., which is indicated by the role-attribute of
> the corresponding used-relation. This can help to find out which data
> entity is progenitor of another, even without the wasDerivedFrom
> relationship.
>
> Similarly: what if you are not interested in the actual processing step,
>>> but
>>> just want to record that one image was derived from another, without any
>>> further information? (e.g. copying process, simple format conversion).
>>> If we
>>> insist on using the Used/WasGeneratedBy construct always, then even for
>>> those simple cases one needs to define "empty" activities, which then
>>> blow-up the serialisations.
>>>
>>
>> True.  But that may be a price worth paying if it streamlines client
>> code (that presumably would have to re-introduce the empty activities
>> when parsing such declarations, or their code will be rife with
>> special cases) and, in particular, query patterns in ProvTAP (where
>> two different ways to do the same thing usually require UNION, which
>> ADQL 2.0 doesn't have and that's a long way from becoming mandatory
>> in any form).
>>
>
> Yeah, see, I always had only the serialization formats like PROV-JSON in
> mind, which could be put into the header of a file to keep the provenance
> information with the data. And I imagine how ugly that serialization looks
> with empty entities/activities all over the place and all the additional
> relation entries which are then required.
>
> But yes, multiple ways to do things are making TAP queries much harder
> (UNION, *sigh* ...).
>
> Hm - just another thought: imagine we drop wasDerivedFrom and
> wasInformedBy from the model. Could we then just re-introduce
> wasDerivedFrom/wasInformedBy only in the serializations (for those cases
> with empty in-between activities/entities), since wasDerivedFrom is a valid
> construct in W3C serializations? So we just use it in order to optimize
> serializations? Would that make any sense or confuse everyone completely?
>
> Also, W3C tools can interprete WasDerivedFrom-relations (since it's
>>> borrowed
>>> from W3C), but wouldn't be able to "understand" it, if it's hidden in the
>>> roles and description classes.
>>>
>>
>> That's a serious issue.  The problem with the description classes I
>> understand, but that I think is a minor issue; whatever progenitor is
>> declared in a description class is probably largely formal in the
>> first place (as all instances depend on it).
>>
>> The problem with the roles I don't understand, but probably because I
>> don't really know much about the different models.
>>
>
> You could specify in the descriptions that the input data with
> used.role='r1' is the progenitor of the output data with
> wasGeneratedBy.role='r2'. A VO client may gain knowledge about this, but a
> W3C tool wouldn't know about the special meanings of these roles, and thus
> couldn't give any information about direct progenitors of the output data.
>
> But doesn't the
>> W3C have similar problems, too?  Why can't we do as they do?
>>
>
> If we do it as in the W3C model, then we need to keep wasDerivedFrom and
> wasInformedBy.
>
> We introduced WasInformedBy (again borrowed from W3C) based on use cases
>>> that describe pipelines, chains of activities, where defining and
>>> recording
>>> the intermediate entities is not needed. In that sense, WasInformedBy is
>>> a
>>> short-cut to Used/WasGeneratedBy again, but in contrast to
>>> WasDerivedFrom it
>>> does not provide any further insights. It's really just meant to be used
>>> as
>>> a short-cut when intermediate entities are unimportant.
>>>
>>
>> But can't this be replaced by a single Activity then, taking the
>> inputs of the first pipeline element as inputs and producing the
>> output(s) of the last pipeline element?  How would that be more
>> complicated?
>>
>
> Yes, that can be done and that's what the "ActivityFlow" is used for. Its
> individual steps (activities) may be important, however, in order to know
> what has been done to the input dataset and in which order. Here the
> hadStep relations are used to link member-activities with their
> activityFlow and wasInformedBy is used to chain the member-activities
> together in the correct order.
>
> E.g. If I want to retrieve all images from a database where dark frame is
> subtracted, then I could search for all entities which at some point in
> their history had an activity of type 'dark frame correction' or similar.
> But if there is just one activity 'calibration', then I am missing the
> finer details (which calibrations steps were done).
>
> Ok, maybe we could also come up with a standard way how to put all the
> attributes and parameters of individual steps into one big activity (in the
> case that minor steps are unimportant) in order to make still visible what
> happened to the data without explicitly modelling these steps ... I think
> we need more examples from use cases to decide this.
>
> Cheers,
> Kristin
>
> --
> -------------------------------------------------------
> Dr. Kristin Riebe
> Press and Public Outreach
>
> Email: kriebe at aip.de, webmaster at aip.de
> Phone: +49 331 7499-377
> Room:  Bib/3
> -------------------------------------------------------
> Leibniz-Institut für Astrophysik Potsdam (AIP)
> An der Sternwarte 16, D-14482 Potsdam
> Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
> Stiftung bürgerlichen Rechts
> Stiftungsverzeichnis Brandenburg: 26 742-00/7026
> -------------------------------------------------------
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20171016/c617fa81/attachment.html>


More information about the dm mailing list