New ProvenanceDM working draft released, part I

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Mon Oct 16 11:41:28 CEST 2017


Hi DM,

I reviewed Kristin's first reaction (which I had initially
postponed), and I think there are a few aspects in it that have not
been addressed in the other discussions.  So:

On Tue, Oct 10, 2017 at 11:22:39PM +0200, Kristin Riebe wrote:
> > p. 15 rights with values "public, restricted, internal" -- I realise
> > that provenance has very different use cases from Registry, but a
> > similar enumeration in VOResource turned out to be largely useless.  I
> > *think* what in practice would make much more sense is license URIs
> > (e.g., CC-0, CC-BY, etc, with perhaps some custom IVOA URIs for
> > proprietary and proprietary-unavailable data).
> 
> Hm, I'm not sure that proper licensing of datasets will happen any time
> soon. But in order to be more compatible with other VO models, we should use
> DatasetDM's RightsType (section 6.2.3 of the WorkingDraft), which defines
> "public, secure and propriatary". Or come up with a different scheme
> together with the DatasetDM authors.

I think that would be a good thing altogether;  declaring proper
licenses might seem a fairly un-academic exercise and pretty much is
until you want to revive an orphaned dataset and get into trouble
with your legal department.  Or until you want a part of the data be
included with <popular software package of your choice>.

ProvDM would be the natural place to have that.  VOResource 1.1
already has some language on this, and it'd be great if we could sync
this between Provenance and Registry.


> > p. 17 "The information this [WasDerivedFrom] relation provides is
> > somewhat redundant..." -- this scares me.  I've not properly thought
> > through the relationship between WasDerivedFrom on the one side and
[...]
> > happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
> > cover what WasDerivedFrom is designed to do and then drop
> > WasDerivedFrom?
> 
> I understand your worries.
> The main difference is: not every input of an activity that generated an
> entity will automatically have a "WasDerivedFrom", it's semantically
> different. E.g. an image is usually "derived from" another image, but not
> from "auxiliary" input like a configuration file or a parameter (which were
> also used as input from the generating activity).

Hm -- why not?  I don't really see a use case to treat them
differently: Dependency modeling, debugging, giving credit -- isn't
for all of these the configuration file pretty much the same as the
raw instrument output (say)?

> In principle (i.e. I think it should work, but haven't really tried it in an
> implementation with realistic data) the "role" attribute to Used and
> WasGeneratedBy, together with the corresponding links to description classes
> can be used to express which entity was derived from which progenitor
> entity, even without the explicit WasDerivedFrom link. But doing it this way
> would be a huge overhead for those use cases where description classes are
> not needed.

I'm not sure I understand how that overhead comes about -- is it
because you can define dependencies in bulk in the description?

Perhaps an example might help here?

> Similarly: what if you are not interested in the actual processing step, but
> just want to record that one image was derived from another, without any
> further information? (e.g. copying process, simple format conversion). If we
> insist on using the Used/WasGeneratedBy construct always, then even for
> those simple cases one needs to define "empty" activities, which then
> blow-up the serialisations.

True.  But that may be a price worth paying if it streamlines client
code (that presumably would have to re-introduce the empty activities
when parsing such declarations, or their code will be rife with
special cases) and, in particular, query patterns in ProvTAP (where
two different ways to do the same thing usually require UNION, which
ADQL 2.0 doesn't have and that's a long way from becoming mandatory
in any form).

There is, of course, the additional philosophical aspect that if you
allow sloppyness, people will take more advantage of that than you'd
like.

But granted, this has to be weighed against the fact that lousy
provenance is probably better than none at all, and none is what we
get if we ask too much.  So -- I'm just pleading that it's worth
trying really hard that there's "one obvious way to do it" for as
many "it"s as possible (and that that obvious way is what people will
typically try first).

> Also, W3C tools can interprete WasDerivedFrom-relations (since it's borrowed
> from W3C), but wouldn't be able to "understand" it, if it's hidden in the
> roles and description classes.

That's a serious issue.  The problem with the description classes I
understand, but that I think is a minor issue; whatever progenitor is
declared in a description class is probably largely formal in the
first place (as all instances depend on it).

The problem with the roles I don't understand, but probably because I
don't really know much about the different models.  But doesn't the
W3C have similar problems, too?  Why can't we do as they do?


> > p. 19f WasInformedBy vs. ActivityFlow -- Again, I'm a bit alarmed that
> > there are two "features" here that apparently serve the same purpose:
> > Hide intermediate entities.  We're not doing anyone a favour by enabling
> > a "choose what you like" approach.  I'd say we should pick one, and
> > since it seems to me the less ugly alternative, I'd go for ActivityFlow.
> 
> We introduced WasInformedBy (again borrowed from W3C) based on use cases
> that describe pipelines, chains of activities, where defining and recording
> the intermediate entities is not needed. In that sense, WasInformedBy is a
> short-cut to Used/WasGeneratedBy again, but in contrast to WasDerivedFrom it
> does not provide any further insights. It's really just meant to be used as
> a short-cut when intermediate entities are unimportant.

But can't this be replaced by a single Activity then, taking the
inputs of the first pipeline element as inputs and producing the
output(s) of the last pipeline element?  How would that be more
complicated?

         -- Markus


More information about the dm mailing list