IVOA Provenance DM - request for comments

Mon Oct 29 17:26:51 CET 2018

Dear DM,

On Wed, Oct 17, 2018 at 05:41:43PM +0200, Mathieu Servillat wrote:
> We are requesting comments for the IVOA Provenance Data Model. The proposed
> recommendation is available through its dedicated RFC web page and attached
> to this email:
> 
> http://wiki.ivoa.net/twiki/bin/view/IVOA/ProvenanceRFC

I've posted the following to the Wiki, but I thought having it on the
list might be more conducive to discussions, so here's what I my
thoughts were while reviewing this.  

TL;DR: let's only have the core model in 1.0.  We can always add
extensions in 1.1.

And now, with apologies for the longish mail:

(a) Can I ask you to remove "IVOA Data Model Working Group" from the
list of authors?  I don't think it helps anyone, but things like these
are painful for computers trying to do something sensible with author
lists and have stung me far too often.

(b) Introduction: "In this document, we discuss a draft of an IVOA
standard data model for describing...".  This obviously shouldn't make
it into a REC.  I'd drop the sentence right now and start with:
"According to \citet{std:W3CProvDM}, provenance is ...  For this
document, we adopt that definition.".

(c) Minimum requirements: "We derived from our goals and use cases"
doesn't seem to be quite true to me -- e.g., I don't see a use case for
exchange of provenance information with non-IVOA software ("standard
model") or even the links to other IVOA DMs.  I don't dispute these are
sane requirements, of course.  Can't you just write: "We adopt the
following requirements for the Prov DM"?

(d) In the requirements, I'm not terribly happy about "if applicable"
and friends.  Can't you, for instance, say just *which* activities are
exempt from having to have input entities?  Sure, if that gets too
verbose, it's counter-productive, but perhaps a few words can already go
a long way towards making the requirements a bit more precise?

(e) "Activities may point to output entities." -- why just "may"?  What
purpose could an output-less activity serve?

(f) "Entities, Activities and Agents [...] should have persistent
identifiers." I wouldn't do this -- many entities are fairly ephemeral,
and even recommending to obtain a DOI for, say, a flatfield is, I think,
going much too far.  Similarly, not everyone may want to have an ORCID
or spread it in a provenance database (and I'm not getting started on
the GDPR here).  And no, "it's optional" doesn't invalidate that
point: if it's a SHOULD a tool would still drop warnings if your
flatfield doesn't have a DOI, and that can very well hide actual
problem  Can't we just strike any language on PIDs here?

(g) Fig. 3, "main core classes".  I'm still unconvinced the
wasDerivedFrom and wasInformedBy relations are a good idea in our
context.  I realise they are shortcuts and thus might seem convenient
for people *generating* provenance instances.  However, many more people
will consume them.  To them, every feature you add is extra work, and
they'd probably have to de-serialise your shortcuts into null activities
or null entities.  Which they won't appreciate.  

Also, since you could just as well generate these null entities or
activities yourself (i.e., in your provenance instances), these two
additional relationships introduce multiple ways to represent the
same thing.  That's always an indication for a feature that will lead
to headache later.

So, let me plead again: Are the shortcuts *really* so valuable to you
that it's worth burdening our implementors with them?

I also don't find too convincing the rationale for wasDerivedFrom on
p. 14, "If there is more than one input and more than one output to
an activity, it is not clear which entity was derived from which".
If you have an activity with multiple inputs and outputs, it stands
to reason that all inputs influence all outputs, so there's nothing
for wasDerivedFrom to annotate.  If there's distinct, unrelated
groups of inputs and outputs then you really have two activities and
you should describe them as such rather than hack around the
deficient description.

Similarly for the "deemed to be not important enough to be recorded in a
pipeline" on wasInformedBy.  The overhead of introducing an Entity is
really not high (unless of course you require persistent identifiers for
them...).  And nothing is so insignificant that a few words of
description couldn't come in handy when someone reads a provenance
graph.

And then "state that an activity communicates with another"... hm --
that's not provenance, that's activity description ("workflow"), no?

(i) Table 1, "attributes of the Entity class": From my Registry
experience, "rights" as specified here has been profoundly useless (in
10 years of having it in the Registry nobody has used it as designed);
in VOResource 1.1 we therefore moved to DataCite's model of copyright
and licensing information, which I'd recommend here as well if I
didn't recommend removing rights here in the first place.  You see, I
don't think this is provenance's turf -- it's not in W3C PROV either.
What use case did you have in mind for that?

(j) Table 1, "attributes of the Entity class": if W3C PROV calls the
description "description", and most everything else in the VO has
"description", is there any deep reason you're using "annotation"?
What would break if you used "description", too?

(k) Table 1, "attributes of the Entity class": in the caption you offer
"url" as a "project-specific attribute" -- how would that be
different from the standard "location" attribute?  What should a
client do if there is both url and location?

(l) Sect. 2.1.2 cites the "Dataset Metdata Model" -- since DatasetDM has
a large overlap with ProvDM, and DatasetDM hasn't seen activity since
March 2016, I'd rather not reference it here (as it says in opening
material of WDs: 'It is inappropriate to use IVOA Working Drafts as
reference materials or to cite them as other than "work in progress".').
My hope is still that once ProvDM is there we can perhaps create a
version of DatasetDM with a clear separation of concerns with ProvDM.
If that happens, we'll be happy if we we've kept recursive
dependencies at a minimum here.  A similar argument applies to 2.1.6.

(m) Sect 2.1.4 Activity -- what's the rationale for making startTime and
endTime mandatory?  Is there actually software that would become more
complex if it couldn't rely on these?  As an occasional user of provenance
information, I have to say time was one of the processing attributes
I've used less often (compared to, say a description or the parameters
of the processing step).

(n) Sect 2.1.4 Activity -- I'm very skeptical of the "status" attribute.
Do you really want to record failed activities?  If so, at least
precisely define what you can have in status and define what it's for
(a use case in section 1.1 would also be helpful).  As a cautionary
tale, the Registry lets people say that Resources can be active,
inactive or deleted (in addition to the sensible deleted flag on the
OAI-PMH level).  Few VOResource features have wreaked more havoc,
while really giving one nothing over what OAI-PMH already has.  It's
really much safer if you say "if it broke, don't advertise it".

(o) Sect. 2.1.5 Used/@time, WasGeneratedBy/@time -- are there really
important use cases in which these couldn't be replaced by the
activity's startTime and endTime (operationally, not concenptually)?
Again, each extra feature puts a burden on the implementors, and I
have a hard time imagining use cases in which this granularity would
be necessary (if there are, you should really put them into Sect.
1.1).

(p) 2.1.6 Agent, WasAttributedTo/@role.  Rather than provide an "e.g."
table of terms in the document, why don't you create a vocabulary
right away?  There's nice tooling for this -- just ask me if
interested.  But I'll admit right away I'm not terribly happy with
the list of terms as it stands now -- if you look at the DataCite
metadata kernel (https://support.datacite.org/docs/schema-40),
contributorType, there are many overlaps with your list -- can't you
re-use/reference what DataCite has?  You see, it would suck if I had
to introduce some static mapping between my DataCite metadata and
provenance metadata I may need to write somewhere.

(q) Extended Model.  I admit to not having reviewed it.  I'd strongly
vote for having the core model put into REC frist and only then going
for the extended model.

My impression is it's difficult enough to get core right.

As ProvDM-Core is taken up, we can figure out what else we need and
what might already be covered sufficiently well by core.  Which of
your use cases would you have to drop until something like the
extended metadata were standardised?

(r) Serialisation, Introduction: 'For FITS files, a provenance extension
called "PROVENANCE" could be added which contains provenance information
of the activities that generated the FITS file.'  Please let's avoid
subjunctive language in specs -- it helps nobody ("should I implement
this, now, or shouldn't I?").  

Either say "To include provenance information into a FITS file,
generate a PROV-N string and write it as the array of a PROVENANCE
extension (BITPIX=8)" (or whatever) or don't say anything at all.

(s) Sect 3.3, "VOTable Format" -- as I said in my last review, I don't
see what purpose this VOTable serialisation serves.  At least "emphasize
the compatibility" is far to weak a reason for putting something into a
standard.in my book  Remember, people have to implement this stuff.

I'm not arguing against a relational mapping of your model, but that
needs to be defined much more carefully (presumably in ProvTAP, then).
I strongly vote to remove the entire section 3.3; but if you don't
remove it, it needs to explain *much* better what to do where (and why).

(t) Sect 3.4 "Description classes for web services" -- it's a cute idea,
but it's so far from provenance that it really doesn't belong in this
specification.  If you think there's a use case for this, please
transport this material into the DataLink specification (there's going
to be an update for it anyway fairly soon).  Nobody will look for
material like this in the documentation for the provenance DM.

(u) Sect 4 "Acessing provenance information" -- my advice is to strike
this section and integrate what little material there still is in it into
the introduction.

(v) Appendix A "Examples" -- wouldn't it be enough to just show (perhaps
an abbreviated rendition of) the PROV-N example and tell people how to
use standard software to get to the PROV-JSON one?  As to VOTable, see
above.  

Note that ivoatex also has an auxiliaryurl macro that you can use to
deliver example files without having to include them verbatim in the
document (see ivoatexDoc).  

Apart from reducing the scaryness of the document (shaving off 10
pages downgrades it from OMG-60-pages! to Oh-dang-50-pages, which
probably helps adoption a lot, shortening the examples section to
what humans actually want or need to see probably saves a few trees,
too -- IVOA documents are printed occasionally...

(w) Appendix B "Links to other DMs" -- oh my.  "When delivering the data on
request, the serialized versions can be adjusted to the corresponding no-
tation." -- excuse me, but that won't work.  If I get a request, how am
I to know if I should include ProvDM or DatasetDM metadata?  What
technical reason should there be to distinguish between the two?

No, I'm sorry, but we simply have to clean up our act.  There needs
to be at most one model per piece of reality.  DatasetDM fortunately
isn't REC yet -- it still can be re-written to use your classes where
appropriate.

With SimDM I'd not be worried too much -- people doing it probably won't
bother with ProvDM or much else anyway.  I suppose it'd be all right to
just say in the introduction something like "For historical reasons,
SimDM has its own rendition of provenance; we make no effort of
reconciling the current and W3C's efforts with it".

I'm also not convinced that appendix on UWS (B.3) couldn't be shortened
into a paragraph in the introduction; that, of course, is even easier if
we keep to the core model and thus won't have to explain about
Parameter.

B.4, finally, is too much of a "could be further developed" thing.  I'm
always advocating keeping promises for the future out of standard
documents unless they actually help to keep implementors from making
false assumptions.  This doesn't seem to be the case here.  Can we
remove it?

I've also done some very minor changes that I hope are uncontroversial
in rev. 5209.

           -- Markus