New ProvenanceDM working draft released, part I
Kristin Riebe
kriebe at aip.de
Tue Oct 17 23:19:30 CEST 2017
Hi Arnold,
I haven't looked at that yet.
Cheers,
Kristin
Am 16.10.2017 um 23:33 schrieb Arnold Rots:
> I have to apologize for not having followed the discussion, being too
> busy with other subjects.
> But it occurred to me, as we are starting to use DataCite DOIs for data
> citation, that the DataCite metadata contains some provenance items, too.
> Did anyone look to see whether the WD proposal is consistent with those
> items?
>
> Cheers,
>
> - Arnold
>
> On Oct 16, 2017 4:50 PM, "Kristin Riebe" <kriebe at aip.de
> <mailto:kriebe at aip.de>> wrote:
>
> Hi Markus, DM,
>
> I think that would be a good thing altogether; declaring proper
> licenses might seem a fairly un-academic exercise and pretty much is
> until you want to revive an orphaned dataset and get into trouble
> with your legal department. Or until you want a part of the data be
> included with <popular software package of your choice>.
>
> ProvDM would be the natural place to have that. VOResource 1.1
> already has some language on this, and it'd be great if we could
> sync
> this between Provenance and Registry.
>
>
> Okay, we could give it a try. I've put it on the TODO list at the
> wiki page. (Just a reminder, it's
> http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel
> <http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel>.)
>
> happen. Are you absolutely sure you can't fix
> WasGeneratedBy/Used to
> cover what WasDerivedFrom is designed to do and then drop
> WasDerivedFrom?
>
> [...]
> The main difference is: not every input of an activity that
> generated an
> entity will automatically have a "WasDerivedFrom", it's
> semantically
> different. E.g. an image is usually "derived from" another
> image, but not
> from "auxiliary" input like a configuration file or a
> parameter (which were
> also used as input from the generating activity).
>
>
> Hm -- why not? I don't really see a use case to treat them
> differently: Dependency modeling, debugging, giving credit -- isn't
> for all of these the configuration file pretty much the same as the
> raw instrument output (say)?
>
>
> Well, I think that 'wasDerivedFrom' is meant to be used to just give
> you the main track, i.e. the main progenitors. So I expect a
> wasDerivedFrom relationship only to those input files of the
> generating activity that are the main inputs. E.g. if an image is
> corrected using a dark frame, then the image was derived from the
> raw image, not from the dark frame. But the raw image and dark frame
> are both inputs.
>
> Here's another use case for wasDerivedFrom:
> Imagine that you have two input images i1, i2 and two result images
> o1 and o2 for an activity. The wasDerivedFrom relationship can then
> tell you that o1 was derived from i1 and o2 from i2 (and not o1 from
> i2 or so). So it's adding more information.
>
> In principle (i.e. I think it should work, but haven't
> really tried it in an
> implementation with realistic data) the "role" attribute to
> Used and
> WasGeneratedBy, together with the corresponding links to
> description classes
> can be used to express which entity was derived from which
> progenitor
> entity, even without the explicit WasDerivedFrom link. But
> doing it this way
> would be a huge overhead for those use cases where
> description classes are
> not needed.
>
>
> I'm not sure I understand how that overhead comes about -- is it
> because you can define dependencies in bulk in the description?
>
> Perhaps an example might help here?
>
>
> I need more time to work out a good example. The idea is, that you
> can predefine the expected input and output datatypes for each
> activity using ActivityDescription, EntityDescription and their
> relations.
> And thus you know which of the input data is auxiliary data/config
> file/dark frame/raw image/..., which is indicated by the
> role-attribute of the corresponding used-relation. This can help to
> find out which data entity is progenitor of another, even without
> the wasDerivedFrom relationship.
>
> Similarly: what if you are not interested in the actual
> processing step, but
> just want to record that one image was derived from another,
> without any
> further information? (e.g. copying process, simple format
> conversion). If we
> insist on using the Used/WasGeneratedBy construct always,
> then even for
> those simple cases one needs to define "empty" activities,
> which then
> blow-up the serialisations.
>
>
> True. But that may be a price worth paying if it streamlines client
> code (that presumably would have to re-introduce the empty
> activities
> when parsing such declarations, or their code will be rife with
> special cases) and, in particular, query patterns in ProvTAP (where
> two different ways to do the same thing usually require UNION, which
> ADQL 2.0 doesn't have and that's a long way from becoming mandatory
> in any form).
>
>
> Yeah, see, I always had only the serialization formats like
> PROV-JSON in mind, which could be put into the header of a file to
> keep the provenance information with the data. And I imagine how
> ugly that serialization looks with empty entities/activities all
> over the place and all the additional relation entries which are
> then required.
>
> But yes, multiple ways to do things are making TAP queries much
> harder (UNION, *sigh* ...).
>
> Hm - just another thought: imagine we drop wasDerivedFrom and
> wasInformedBy from the model. Could we then just re-introduce
> wasDerivedFrom/wasInformedBy only in the serializations (for those
> cases with empty in-between activities/entities), since
> wasDerivedFrom is a valid construct in W3C serializations? So we
> just use it in order to optimize serializations? Would that make any
> sense or confuse everyone completely?
>
> Also, W3C tools can interprete WasDerivedFrom-relations
> (since it's borrowed
> from W3C), but wouldn't be able to "understand" it, if it's
> hidden in the
> roles and description classes.
>
>
> That's a serious issue. The problem with the description classes I
> understand, but that I think is a minor issue; whatever
> progenitor is
> declared in a description class is probably largely formal in the
> first place (as all instances depend on it).
>
> The problem with the roles I don't understand, but probably
> because I
> don't really know much about the different models.
>
>
> You could specify in the descriptions that the input data with
> used.role='r1' is the progenitor of the output data with
> wasGeneratedBy.role='r2'. A VO client may gain knowledge about this,
> but a W3C tool wouldn't know about the special meanings of these
> roles, and thus couldn't give any information about direct
> progenitors of the output data.
>
> But doesn't the
> W3C have similar problems, too? Why can't we do as they do?
>
>
> If we do it as in the W3C model, then we need to keep wasDerivedFrom
> and wasInformedBy.
>
> We introduced WasInformedBy (again borrowed from W3C) based
> on use cases
> that describe pipelines, chains of activities, where
> defining and recording
> the intermediate entities is not needed. In that sense,
> WasInformedBy is a
> short-cut to Used/WasGeneratedBy again, but in contrast to
> WasDerivedFrom it
> does not provide any further insights. It's really just
> meant to be used as
> a short-cut when intermediate entities are unimportant.
>
>
> But can't this be replaced by a single Activity then, taking the
> inputs of the first pipeline element as inputs and producing the
> output(s) of the last pipeline element? How would that be more
> complicated?
>
>
> Yes, that can be done and that's what the "ActivityFlow" is used
> for. Its individual steps (activities) may be important, however, in
> order to know what has been done to the input dataset and in which
> order. Here the hadStep relations are used to link member-activities
> with their activityFlow and wasInformedBy is used to chain the
> member-activities together in the correct order.
>
> E.g. If I want to retrieve all images from a database where dark
> frame is subtracted, then I could search for all entities which at
> some point in their history had an activity of type 'dark frame
> correction' or similar. But if there is just one activity
> 'calibration', then I am missing the finer details (which
> calibrations steps were done).
>
> Ok, maybe we could also come up with a standard way how to put all
> the attributes and parameters of individual steps into one big
> activity (in the case that minor steps are unimportant) in order to
> make still visible what happened to the data without explicitly
> modelling these steps ... I think we need more examples from use
> cases to decide this.
>
> Cheers,
> Kristin
>
> --
> -------------------------------------------------------
> Dr. Kristin Riebe
> Press and Public Outreach
>
> Email: kriebe at aip.de <mailto:kriebe at aip.de>, webmaster at aip.de
> <mailto:webmaster at aip.de>
> Phone: +49 331 7499-377 <tel:%2B49%20331%207499-377>
> Room: Bib/3
> -------------------------------------------------------
> Leibniz-Institut für Astrophysik Potsdam (AIP)
> An der Sternwarte 16, D-14482 Potsdam
> Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
> Stiftung bürgerlichen Rechts
> Stiftungsverzeichnis Brandenburg: 26 742-00/7026
> -------------------------------------------------------
>
--
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach
Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room: Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------
More information about the dm
mailing list