IVOA Provenance DM -RFC- answers to comments

Sun Nov 4 20:26:09 CET 2018

Hi Markus , Hi DMers,

Thank you for your comments .

I agree the dm list is best suited for a discussion on the specification 
to touch a wider audience .

I try to answer your various points in the text.

Cheers, Mireille

Le 29/10/2018 à 17:26, Markus Demleitner a écrit :
> Dear DM,
>
> On Wed, Oct 17, 2018 at 05:41:43PM +0200, Mathieu Servillat wrote:
>> We are requesting comments for the IVOA Provenance Data Model. The proposed
>> recommendation is available through its dedicated RFC web page and attached
>> to this email:
>>
>> http://wiki.ivoa.net/twiki/bin/view/IVOA/ProvenanceRFC
> I've posted the following to the Wiki, but I thought having it on the
> list might be more conducive to discussions, so here's what I my
> thoughts were while reviewing this.
>
> TL;DR: let's only have the core model in 1.0.  We can always add
> extensions in 1.1.
we need the ActivityDescription class and Parameter class to be able to 
search for some specific processing type on the data.
Activity is only the process launched for the computation.
It does not hold the details of the methods , because those details are 
factorised in the ActivityDescription class.
>
> And now, with apologies for the longish mail:
>
> (a) Can I ask you to remove "IVOA Data Model Working Group" from the
> list of authors?  I don't think it helps anyone, but things like these
> are painful for computers trying to do something sensible with author
> lists and have stung me far too often.
Fine with me if this helps for standards papers citation .
>
> (b) Introduction: "In this document, we discuss a draft of an IVOA
> standard data model for describing...".  This obviously shouldn't make
> it into a REC.  I'd drop the sentence right now and start with:
> "According to \citet{std:W3CProvDM}, provenance is ...  For this
> document, we adopt that definition.".
Ok, agreed.
> (c) Minimum requirements: "We derived from our goals and use cases"
> doesn't seem to be quite true to me -- e.g., I don't see a use case for
> exchange of provenance information with non-IVOA software ("standard
> model") or even the links to other IVOA DMs.  I don't dispute these are
> sane requirements, of course.  Can't you just write: "We adopt the
> following requirements for the Prov DM"?
Good point .
What is not shown in the current document is the connections from a TAP 
service ( SSAP or ObsTAP)
to a Provenance service ( Prov-SAP/ Prov-TAP) .
This was planned to be described in a DAL document.
>
> (d) In the requirements, I'm not terribly happy about "if applicable"
> and friends.  Can't you, for instance, say just *which* activities are
> exempt from having to have input entities?  Sure, if that gets too
> verbose, it's counter-productive, but perhaps a few words can already go
> a long way towards making the requirements a bit more precise?
>
> (e) "Activities may point to output entities." -- why just "may"?  What
> purpose could an output-less activity serve?
>
> (f) "Entities, Activities and Agents [...] should have persistent
> identifiers." I wouldn't do this -- many entities are fairly ephemeral,
> and even recommending to obtain a DOI for, say, a flatfield is, I think,
> going much too far.  Similarly, not everyone may want to have an ORCID
> or spread it in a provenance database (and I'm not getting started on
> the GDPR here).  And no, "it's optional" doesn't invalidate that
> point: if it's a SHOULD a tool would still drop warnings if your
> flatfield doesn't have a DOI, and that can very well hide actual
> problem  Can't we just strike any language on PIDs here?
OK, unique identifiers would be more adequate.
>
> (g) Fig. 3, "main core classes".  I'm still unconvinced the
> wasDerivedFrom and wasInformedBy relations are a good idea in our
> context.  I realise they are shortcuts and thus might seem convenient
> for people *generating* provenance instances.  However, many more people
> will consume them.  To them, every feature you add is extra work, and
> they'd probably have to de-serialise your shortcuts into null activities
> or null entities.  Which they won't appreciate.
>
> Also, since you could just as well generate these null entities or
> activities yourself (i.e., in your provenance instances), these two
> additional relationships introduce multiple ways to represent the
> same thing.  That's always an indication for a feature that will lead
> to headache later.
>
> So, let me plead again: Are the shortcuts *really* so valuable to you
> that it's worth burdening our implementors with them?
The wasDerivedFrom relation is a straightforward link when you want to 
list the progenitors entities for one/some datasets.
In the Triplestore implementation for instance it really speeds up the 
search.
In the relational DB it avoids table joins.

>
> I also don't find too convincing the rationale for wasDerivedFrom on
> p. 14, "If there is more than one input and more than one output to
> an activity, it is not clear which entity was derived from which".
> If you have an activity with multiple inputs and outputs, it stands
> to reason that all inputs influence all outputs, so there's nothing
> for wasDerivedFrom to annotate.  If there's distinct, unrelated
> groups of inputs and outputs then you really have two activities and
> you should describe them as such rather than hack around the
> deficient description.
>
> Similarly for the "deemed to be not important enough to be recorded in a
> pipeline" on wasInformedBy.  The overhead of introducing an Entity is
> really not high (unless of course you require persistent identifiers for
> them...).  And nothing is so insignificant that a few words of
> description couldn't come in handy when someone reads a provenance
> graph.
>
> And then "state that an activity communicates with another"... hm --
> that's not provenance, that's activity description ("workflow"), no?
>
> (i) Table 1, "attributes of the Entity class": From my Registry
> experience, "rights" as specified here has been profoundly useless (in
> 10 years of having it in the Registry nobody has used it as designed);
> in VOResource 1.1 we therefore moved to DataCite's model of copyright
> and licensing information, which I'd recommend here as well if I
> didn't recommend removing rights here in the first place.  You see, I
> don't think this is provenance's turf -- it's not in W3C PROV either.
> What use case did you have in mind for that?
Agreed: rights belong to the dataset involved when we manipulate it . It 
is a property we can attach to the Entity represented in Provenance.
However if an Entity is only a value, then there is no dataset attached 
, and then rights can apply.
> (j) Table 1, "attributes of the Entity class": if W3C PROV calls the
> description "description", and most everything else in the VO has
> "description", is there any deep reason you're using "annotation"?
> What would break if you used "description", too?
I agree , if this makes thinks easier to recognise between W3C and IVOA 
views on Provenance.
> (k) Table 1, "attributes of the Entity class": in the caption you offer
> "url" as a "project-specific attribute" -- how would that be
> different from the standard "location" attribute?  What should a
> client do if there is both url and location?
>
> (l) Sect. 2.1.2 cites the "Dataset Metdata Model" -- since DatasetDM has
> a large overlap with ProvDM, and DatasetDM hasn't seen activity since
> March 2016, I'd rather not reference it here (as it says in opening
> material of WDs: 'It is inappropriate to use IVOA Working Drafts as
> reference materials or to cite them as other than "work in progress".').
> My hope is still that once ProvDM is there we can perhaps create a
> version of DatasetDM with a clear separation of concerns with ProvDM.
> If that happens, we'll be happy if we we've kept recursive
> dependencies at a minimum here.  A similar argument applies to 2.1.6.
Ok , then for the current version Prov1.0, we can cite and relate to the 
ObsCore DM Standard .
>
> (m) Sect 2.1.4 Activity -- what's the rationale for making startTime and
> endTime mandatory?  Is there actually software that would become more
> complex if it couldn't rely on these?  As an occasional user of provenance
> information, I have to say time was one of the processing attributes
> I've used less often (compared to, say a description or the parameters
> of the processing step).
The time stamps are a way to check the order/sequence of Activities and 
chain them. 'wasInformedBy' is only an optional relation and is not 
required.
These can be null if this has not been recorded.
It allows to search for long or short activities and reorganize some of 
the re-computing steps for instance.
>
> (n) Sect 2.1.4 Activity -- I'm very skeptical of the "status" attribute.
> Do you really want to record failed activities?  If so, at least
> precisely define what you can have in status and define what it's for
> (a use case in section 1.1 would also be helpful).  As a cautionary
> tale, the Registry lets people say that Resources can be active,
> inactive or deleted (in addition to the sensible deleted flag on the
> OAI-PMH level).  Few VOResource features have wreaked more havoc,
> while really giving one nothing over what OAI-PMH already has.  It's
> really much safer if you say "if it broke, don't advertise it".
>
> (o) Sect. 2.1.5 Used/@time, WasGeneratedBy/@time -- are there really
> important use cases in which these couldn't be replaced by the
> activity's startTime and endTime (operationally, not concenptually)?
> Again, each extra feature puts a burden on the implementors, and I
> have a hard time imagining use cases in which this granularity would
> be necessary (if there are, you should really put them into Sect.
> 1.1).
Agreed to store the main time information into Activity StartTime/EndTime.
when we got interest for an Entity/Data , we can check time details 
corresponding on the associated dataset given by for instance the 
Obscore view which already contains date of creation etc.
>
> (p) 2.1.6 Agent, WasAttributedTo/@role.  Rather than provide an "e.g."
> table of terms in the document, why don't you create a vocabulary
> right away?  There's nice tooling for this -- just ask me if
> interested.  But I'll admit right away I'm not terribly happy with
> the list of terms as it stands now -- if you look at the DataCite
> metadata kernel (https://support.datacite.org/docs/schema-40),
> contributorType, there are many overlaps with your list -- can't you
> re-use/reference what DataCite has?  You see, it would suck if I had
> to introduce some static mapping between my DataCite metadata and
> provenance metadata I may need to write somewhere.
Good point. We need to homogeneize this and see how to reuse /extend the 
Datacite vocabulary in our implementations.
The IVOA vocabulary page can publish this list of terms as for the 
Datalink vocabulary.
>
> (q) Extended Model.  I admit to not having reviewed it.  I'd strongly
> vote for having the core model put into REC frist and only then going
> for the extended model.
>
> My impression is it's difficult enough to get core right.
>
> As ProvDM-Core is taken up, we can figure out what else we need and
> what might already be covered sufficiently well by core.  Which of
> your use cases would you have to drop until something like the
> extended metadata were standardised?
>
> (r) Serialisation, Introduction: 'For FITS files, a provenance extension
> called "PROVENANCE" could be added which contains provenance information
> of the activities that generated the FITS file.'  Please let's avoid
> subjunctive language in specs -- it helps nobody ("should I implement
> this, now, or shouldn't I?").
>
> Either say "To include provenance information into a FITS file,
> generate a PROV-N string and write it as the array of a PROVENANCE
> extension (BITPIX=8)" (or whatever) or don't say anything at all.
This serialisation FITS flavor is currently designed for SVOM pipeline. 
This will be detailed in the Note for Implementation of IVOA PROV-DM and
SVOM related documents currently under preparation.
>
> (s) Sect 3.3, "VOTable Format" -- as I said in my last review, I don't
> see what purpose this VOTable serialisation serves.  At least "emphasize
> the compatibility" is far to weak a reason for putting something into a
> standard.in my book  Remember, people have to implement this stuff.
>
> I'm not arguing against a relational mapping of your model, but that
> needs to be defined much more carefully (presumably in ProvTAP, then).
> I strongly vote to remove the entire section 3.3; but if you don't
> remove it, it needs to explain *much* better what to do where (and why).
>
> (t) Sect 3.4 "Description classes for web services" -- it's a cute idea,
> but it's so far from provenance that it really doesn't belong in this
> specification.  If you think there's a use case for this, please
> transport this material into the DataLink specification (there's going
> to be an update for it anyway fairly soon).  Nobody will look for
> material like this in the documentation for the provenance DM.
>
> (u) Sect 4 "Acessing provenance information" -- my advice is to strike
> this section and integrate what little material there still is in it into
> the introduction.
>
> (v) Appendix A "Examples" -- wouldn't it be enough to just show (perhaps
> an abbreviated rendition of) the PROV-N example and tell people how to
> use standard software to get to the PROV-JSON one?  As to VOTable, see
> above.
>
> Note that ivoatex also has an auxiliaryurl macro that you can use to
> deliver example files without having to include them verbatim in the
> document (see ivoatexDoc).
>
> Apart from reducing the scaryness of the document (shaving off 10
> pages downgrades it from OMG-60-pages! to Oh-dang-50-pages, which
> probably helps adoption a lot, shortening the examples section to
> what humans actually want or need to see probably saves a few trees,
> too -- IVOA documents are printed occasionally...
>
> (w) Appendix B "Links to other DMs" -- oh my.  "When delivering the data on
> request, the serialized versions can be adjusted to the corresponding no-
> tation." -- excuse me, but that won't work.  If I get a request, how am
> I to know if I should include ProvDM or DatasetDM metadata?  What
> technical reason should there be to distinguish between the two?
This is an informative section showing that concepts in Provenance DM 
have some correspondance with already existing DMs .
It give credits to existing work and guides the readers which may have 
already been aware of those existing datamodels .
> No, I'm sorry, but we simply have to clean up our act.  There needs
> to be at most one model per piece of reality.  DatasetDM fortunately
> isn't REC yet -- it still can be re-written to use your classes where
> appropriate.
>
> With SimDM I'd not be worried too much -- people doing it probably won't
> bother with ProvDM or much else anyway.  I suppose it'd be all right to
> just say in the introduction something like "For historical reasons,
> SimDM has its own rendition of provenance; we make no effort of
> reconciling the current and W3C's efforts with it".
>
> I'm also not convinced that appendix on UWS (B.3) couldn't be shortened
> into a paragraph in the introduction; that, of course, is even easier if
> we keep to the core model and thus won't have to explain about
> Parameter.
>
> B.4, finally, is too much of a "could be further developed" thing.  I'm
> always advocating keeping promises for the future out of standard
> documents unless they actually help to keep implementors from making
> false assumptions.  This doesn't seem to be the case here.  Can we
> remove it?
>
>
> I've also done some very minor changes that I hope are uncontroversial
> in rev. 5209.
Please refrain from editing the document.
We have a dedicated editor for it and the document is open for review 
and comments.
It will not speed up the process if everyone changes the document 
according to her/his own view.
Consensus first, and then updates by the editor is more efficient.

>
>             -- Markus

-- 
--
Mireille Louys,  MCF (Associate Professor)
CDS				IPSEO, Images, Laboratoire Icube
Observatoire de Strasbourg	Telecom Physique Strasbourg
11 rue de l'Université		300, Bd Sebastien Brandt CS 10413
F- 67000-STRASBOURG		F-67412 ILLKIRCH Cedex
Tel: +33 3 68 85 24 34