New ProvenanceDM working draft released

Tue Oct 10 11:56:42 CEST 2017

Hi DM,

On Fri, Sep 22, 2017 at 10:14:17AM +0200, Kristin Riebe wrote:
> hooray! The provenance work group has finished a new version of the
> ProvenanceDM working draft, see
> http://wiki.ivoa.net/internal/IVOA/ObservationProvenanceDataModel/WD-ProvenanceDM-1.0-20170921.pdf

Here is some feedback from a first reading of the Provenance WD.  It's a
bit much for one go, but it's a long document, too.  So, to start off, a

TL;DR: I think that...

* ...there should be no custom PROV-VOTable
* ...there should be no ProvDAL, just ProvTAP
* ...if we can have W3C-compatible serialisations, we shouldn't have
  slightly modified, incompatible ones.
* ...we really, *really*, REALLY should try to make sure that IVOA DMs
  use common types.  If ProvDM and DatasetDM model talk about the same
  domain, they shouldn't use different terms and different classes.

So, on to the long version:

p. 6, Use case A: "image from catalog xxx"  -- why "from catalog?"  Why
not just say "this image"?  Oh, and PSA: Don't use the string "xxx" in
your documents unless you don't mind that stupid software (in use, e.g.,
in schools) will block your document.

p. 10, "auto-generated documentation of all classes...
https://volute..." -- while a WD might get away with this, at the latest
the PR needs to have all ancillary files in the document repository.
I'm happy to help working out some aids to make that easier in ivoatex
(e.g., by allowing versioned URLs) -- a VCS, anyway, is far to volatile
for links in standards to point there (unless the IVOA sanctioned first
the VCS and then some practices to ensure a certain amount of
stability).

p. 14, "For entities, we suggest..." -- stylistically, I'd much prefer
language that makes clear what's the normative content.  Here, that
might be "Entties have the attributes listed in..." or something like
that.  While I'm discussing style issues: you're writing "NOT" on p. 15,
and I'd much prefer the use of \emph (or, if you insist, \textbf) over
capitalisation to express emphasis.

p. 15 rights with values "public, restricted, internal" -- I realise
that provenance has very different use cases from Registry, but a
similar enumeration in VOResource turned out to be largely useless.  I
*think* what in practice would make much more sense is license URIs
(e.g., CC-0, CC-BY, etc, with perhaps some custom IVOA URIs for
proprietary and proprietary-unavailable data).

p. 17 "The information this [WasDerivedFrom] relation provides is
somewhat redundant..." -- this scares me.  I've not properly thought
through the relationship between WasDerivedFrom on the one side and
WasGeneratedBy and Used on the other.  But I'm worried if whatever
deficiencies the latter may have are fixed with something reeking of
optional.  Optional features very typically are close to useless to
clients simply because if they can't rely on it, basing code on it is
tricky; if whatever is optional provides some essential value (and
certainly, a make-style "flow graph" of the dependencies between the
entities is essential), it's an interoperability disaster waiting to
happen.  Are you absolutely sure you can't fix WasGeneratedBy/Used to
cover what WasDerivedFrom is designed to do and then drop
WasDerivedFrom?  If, on the other hand, it's just an optimisation in the
current spec already, I'd argue it shouldn't be part of the DM but
rather of an implementation.

p. 19f WasInformedBy vs. ActivityFlow -- Again, I'm a bit alarmed that
there are two "features" here that apparently serve the same purpose:
Hide intermediate entities.  We're not doing anyone a favour by enabling
a "choose what you like" approach.  I'd say we should pick one, and
since it seems to me the less ugly alternative, I'd go for ActivityFlow.

p. 26 value's type "(value dependent)"; then, ParameterDescription has a
datatype attribute not discussed in greater detail.  I'm not sure what
the type system envisioned here is.  In the interest of keeping the
total number of type systems and associated serialisation rules in the
VO low, I'd suggest adding arraysize and xtype to ParameterDescription
and say that parameter.value's value follows the serialisation rules of
VOTable TABLEDATA and DALI.

Table 16, Agent Roles: I like the all-lower-case approach to the role
labels, and therefore I'm dismayed about "coordinator/PI".  Not only
upper case suddenly, but also a slash, the one character anathema to
unix file names!  Since I guess these would end up in fragment
identifiers in URIs eventually, that might not be as bad as it looks at
first, but still: Please just strike the "/PI" and save us a lot of
potential later headache.

Table 17, the mapping between ProvenanceDM and DatasetDM labels.
Frankly, this makes me weep.  Is it *really* not possible to use a
common nomenclature, or rather common types, even between IVOA DMs?  I
have a hard time imagining how I'd implement this, in particular with a
view to "This list is not complete".  And no, "the serialised versions
can be adjusted to the corresponding notation" is not reassuring at all.
How on earth is a piece of software supposed to know what "corresponds"
to a given request?  My impression: For interoperability with the wider
world (W3C), DatasetDM should budge and just use ProvDM classes
whereever possible.

Table 19 -- oh bother.  Any chance for a SimDM 2.0 that avoids the
duplication of classes?  As far as I can see, there's not terribly much
SimDM 1.0 content out there yet, so perhaps a version 2.0 wouldn't break
too much at this point, would it?

Sect. 4.3, PROV-VOTable -- I'm *really* unhappy that a VO-DML-defined
data model defines a "custom" serialisation while the authors of the
"mapping" standard (that's supposed to define how such DMs are to be
represented in VOTable in general) work on something that looks entirely
different.  Everyone will eventually regret that.  So, please, *please*
don't have sect. 4.3; instead, help out the VO-DML mapping folks and
perhaps fill in any parameters left open in there in a version 1.1 of
ProvDM.

Sect. 4.2 vs. sect. 4.5 -- I don't quite understand what purpose the
"almost-W3C" serialisations in 4.2 serve -- why not go for the 4.5
solution right away?  Ok, the need for a bit of mapping is a slight
uglification, but the total pollution of the IT environment is a lot
less with this little blemish than if you have a similar but
incompatible format *and* the 4.5 thing *on top*.

Sect 5 -- I'm not sure if I'm terribly happy to have an access protocol
folded into this already fairly long document (on the other hand: the
less standards the better).  But one thing I'm sure about is that you
don't want two access protocols.  It'll be hard enough to gain uptake
for one of them.  If you let people choose which one to implement, in
all likelihood half will use one protocol, and the other half the other,
thus at least doubling the implementation effort for client authors.  My
take: There are enough TAP engines available that there is no reason not
to just use TAP plus a canonical relational DM representation,
preferably built according to standard, VO-DML rules.

p. 43 -- "note that the relations wasDerivedFrom..." -- I'll mention
that the mere fact that these "optimisations" bite you in protocol
design for me is another indication that they shouldn't be put into the
standard in the first place.

Sect 5.3 -- While, as I said, I think having a relational mapping of the
model is an excellent idea, sect. 5.3 is not enough for a REC.  To make
this implementable, you'd at least have to say

* what tables, what columns, what column types make up the model -- I'd 
  hence prefer if appendix B got a bit more comments and went here 
  (which is not a problem if you dump PROV-VOTable)
* perhaps how to discover the ProvTAP service for a given PubDID or
  other identifier?
* do these guys live in a schema?  Any schema, perhaps, so a single TAP
  service can keep multiple, independent provenance stores?
* giving a data model identifier for TAPRegExt indicates there can only
  be one provenance store per TAP service -- is that really what you
  want?  My recommendation for the future is to use URIs in utype
  attributes of schema or table elements.

And again, I think as much as possible of this should come out of a
defined process valid for all VO-DML DMs.

Sect. 6 -- That content doesn't seem to be REC-level material to me; if
you write another, note-type, document anyway, why not push it there?

General point I:  Currently, my main use case would be: 

  In a VOTable containing a photometric time series, declare the column
  in which the source images are linked

This would be immediately actionable by clients and, I think, obviously
useful (people configure such functionality by hand in TOPCAT right now,
but of course that only scales so far when you start to automate things
or venture into less familar data).  From the current document I can't
really see how to even start with this.  Granted, I've not put in this
use case into your wiki in time.  But I think if you just employed
standard VO-DML mapping (and helped bring that out of the door instead
of inventing a custom solution), that use case would, more or less, fall
into your laps.

General point II: An important motivation for modeling provenance in the
first place was a standardisation of the plethora of FITS keywords
describing ambient conditions and perhaps even instrument telemetry,
also with a view to go beyond FITS at some point.  While I can see why
the current document cannot provide the necessary vocabularies, I think
it would be great if it provided clear guidance as to what would be
required to make it happen (i.e., presumably discuss one or more
vocabularies).

          -- Markus