New ProvenanceDM working draft released, part I

Tue Oct 10 23:22:39 CEST 2017

Hi Markus,

thanks for getting discussions started! :-)
Some of your suggestions would certainly cut the draft to a more 
digestible number of pages. ;-)

So, first short answers to your TL;DR:

> TL;DR: I think that...
> 
> * ...there should be no custom PROV-VOTable
- agreed, should come from VO-DML mapping eventually, but since we don't 
have it yet, the tables in principle show what we expect from that mapping

> * ...there should be no ProvDAL, just ProvTAP
- nope, disagree; they serve different use cases (ProvDAL for retrieving 
serialized provenance metadata from a service for e.g. usage with other 
VO/W3C tools; ProvTAP for searching for specific information based on 
provenance metadata)

> * ...if we can have W3C-compatible serialisations, we shouldn't have
>    slightly modified, incompatible ones.
- The benefit of the incompatible ones is that they more directly 
express the underlying data model structure in our model
(which, I admit, they don't have to do. Maybe when having more 
experience with our implementations (I still have to try W3C 
serializations of parameters and description classes), we can reconsider 
that decision.)

> * ...we really, *really*, REALLY should try to make sure that IVOA DMs
>    use common types.  If ProvDM and DatasetDM model talk about the same
>    domain, they shouldn't use different terms and different classes- Well, yeah, *sigh*. Dataset-authors - do you see any chance of 
reopening discussions on your data model classes and attributes?

------------------------------------------------------------------------
Here's a start on the long version discussion. I'll split it up and send 
more answers in the next days.

> p. 6, Use case A: "image from catalog xxx"  -- why "from catalog?"  Why
> not just say "this image"?  Oh, and PSA: Don't use the string "xxx" in
> your documents unless you don't mind that stupid software (in use, e.g.,
> in schools) will block your document.

Okay, thanks.

> p. 10, "auto-generated documentation of all classes...
> https://volute..." -- while a WD might get away with this, at the latest
> the PR needs to have all ancillary files in the document repository.
> I'm happy to help working out some aids to make that easier in ivoatex
> (e.g., by allowing versioned URLs) -- a VCS, anyway, is far to volatile
> for links in standards to point there (unless the IVOA sanctioned first
> the VCS and then some practices to ensure a certain amount of
> stability).

And here I thought I can be proud to have this auto-generation stuff 
working ... :-)
Sure, we can put it at a more permanent place in the document repo when 
everything is stable, or refer to the volute-repo with a revision number.

Right now, this documentation is rather an add-on; the attributes for 
each class are also listed in tables in the document. I've still some 
issues with the UML and auto-generated classes, but I'll open another 
thread for this.

> p. 14, "For entities, we suggest..." -- stylistically, I'd much prefer
> language that makes clear what's the normative content.  Here, that
> might be "Entties have the attributes listed in..." or something like
> that.  While I'm discussing style issues: you're writing "NOT" on p. 15,
> and I'd much prefer the use of \emph (or, if you insist, \textbf) over
> capitalisation to express emphasis.

Okay, thanks.

> p. 15 rights with values "public, restricted, internal" -- I realise
> that provenance has very different use cases from Registry, but a
> similar enumeration in VOResource turned out to be largely useless.  I
> *think* what in practice would make much more sense is license URIs
> (e.g., CC-0, CC-BY, etc, with perhaps some custom IVOA URIs for
> proprietary and proprietary-unavailable data).

Hm, I'm not sure that proper licensing of datasets will happen any time 
soon. But in order to be more compatible with other VO models, we should 
use DatasetDM's RightsType (section 6.2.3 of the WorkingDraft), which 
defines "public, secure and propriatary". Or come up with a different 
scheme together with the DatasetDM authors.

> p. 17 "The information this [WasDerivedFrom] relation provides is
> somewhat redundant..." -- this scares me.  I've not properly thought
> through the relationship between WasDerivedFrom on the one side and
> WasGeneratedBy and Used on the other.  But I'm worried if whatever
> deficiencies the latter may have are fixed with something reeking of
> optional.  Optional features very typically are close to useless to
> clients simply because if they can't rely on it, basing code on it is
> tricky; if whatever is optional provides some essential value (and
> certainly, a make-style "flow graph" of the dependencies between the
> entities is essential), it's an interoperability disaster waiting to
> happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
> cover what WasDerivedFrom is designed to do and then drop
> WasDerivedFrom?

I understand your worries.
The main difference is: not every input of an activity that generated an 
entity will automatically have a "WasDerivedFrom", it's semantically 
different. E.g. an image is usually "derived from" another image, but 
not from "auxiliary" input like a configuration file or a parameter 
(which were also used as input from the generating activity).

There are simple use cases with no auxiliary input, so the 
WasDerivedFrom relations can indeed be auto-generated from the 
Used/WasGeneratedBy relations (Mireille and Francois are using it in 
their ProvTAP implementation) and thus really are redundant, but that's 
not possible for all use cases.

In principle (i.e. I think it should work, but haven't really tried it 
in an implementation with realistic data) the "role" attribute to Used 
and WasGeneratedBy, together with the corresponding links to description 
classes can be used to express which entity was derived from which 
progenitor entity, even without the explicit WasDerivedFrom link. But 
doing it this way would be a huge overhead for those use cases where 
description classes are not needed.

Similarly: what if you are not interested in the actual processing step, 
but just want to record that one image was derived from another, without 
any further information? (e.g. copying process, simple format 
conversion). If we insist on using the Used/WasGeneratedBy construct 
always, then even for those simple cases one needs to define "empty" 
activities, which then blow-up the serialisations.

Also, W3C tools can interprete WasDerivedFrom-relations (since it's 
borrowed from W3C), but wouldn't be able to "understand" it, if it's 
hidden in the roles and description classes.

> p. 19f WasInformedBy vs. ActivityFlow -- Again, I'm a bit alarmed that
> there are two "features" here that apparently serve the same purpose:
> Hide intermediate entities.  We're not doing anyone a favour by enabling
> a "choose what you like" approach.  I'd say we should pick one, and
> since it seems to me the less ugly alternative, I'd go for ActivityFlow.

We introduced WasInformedBy (again borrowed from W3C) based on use cases 
that describe pipelines, chains of activities, where defining and 
recording the intermediate entities is not needed. In that sense, 
WasInformedBy is a short-cut to Used/WasGeneratedBy again, but in 
contrast to WasDerivedFrom it does not provide any further insights. 
It's really just meant to be used as a short-cut when intermediate 
entities are unimportant.

The ActivityFlow is another thing - it's like a collection of activities 
that represent a workflow - with or without intermediate entities being 
defined, so it's the Activity-equivalent to Entity's Collection.

> p. 26 value's type "(value dependent)"; then, ParameterDescription has a
> datatype attribute not discussed in greater detail.  I'm not sure what
> the type system envisioned here is.  In the interest of keeping the
> total number of type systems and associated serialisation rules in the
> VO low, I'd suggest adding arraysize and xtype to ParameterDescription
> and say that parameter.value's value follows the serialisation rules of
> VOTable TABLEDATA and DALI.

Okay, sounds reasonable.

> Table 16, Agent Roles: I like the all-lower-case approach to the role
> labels, and therefore I'm dismayed about "coordinator/PI".  Not only
> upper case suddenly, but also a slash, the one character anathema to
> unix file names!  Since I guess these would end up in fragment
> identifiers in URIs eventually, that might not be as bad as it looks at
> first, but still: Please just strike the "/PI" and save us a lot of
> potential later headache.

Sure, fine with me.

More on the other issues will follow next time.

Cheers,
Kristin

-- 
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach

Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room:  Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------