IVOA Provenance DM - request for comments

Anastasia Galkin agalkin at aip.de
Mon Nov 5 12:58:00 CET 2018


Dear Markus, dear all,

a short comment to the complexity of the extended model. There had been 
already concerns issued, that the model has gotten too complex to handle 
it reasonably. The provenance model must be simple to be implemented by 
the data provider in the first place. But it also should be easy to 
modify as changes occur a lot. It should also be easy and clear "what to 
ask for" from the client side. That is why I fully agree with Markus to 
have first the core model settled in the version 1.0. The core model 
already covers all the provenance use cases (see discussion to the 
parameters and descriptions), is stable and easy to handle.

As I won't be there for the IVOA, I wish you all a safe trip and 
fruitful discussions.

Best,
Anastasia

On 10/29/2018 05:26 PM, Markus Demleitner wrote:
> Dear DM,
>
> On Wed, Oct 17, 2018 at 05:41:43PM +0200, Mathieu Servillat wrote:
>> We are requesting comments for the IVOA Provenance Data Model. The proposed
>> recommendation is available through its dedicated RFC web page and attached
>> to this email:
>>
>> http://wiki.ivoa.net/twiki/bin/view/IVOA/ProvenanceRFC
> I've posted the following to the Wiki, but I thought having it on the
> list might be more conducive to discussions, so here's what I my
> thoughts were while reviewing this.
>
> TL;DR: let's only have the core model in 1.0.  We can always add
> extensions in 1.1.
>
> And now, with apologies for the longish mail:
>
> (a) Can I ask you to remove "IVOA Data Model Working Group" from the
> list of authors?  I don't think it helps anyone, but things like these
> are painful for computers trying to do something sensible with author
> lists and have stung me far too often.
>
> (b) Introduction: "In this document, we discuss a draft of an IVOA
> standard data model for describing...".  This obviously shouldn't make
> it into a REC.  I'd drop the sentence right now and start with:
> "According to \citet{std:W3CProvDM}, provenance is ...  For this
> document, we adopt that definition.".
>
> (c) Minimum requirements: "We derived from our goals and use cases"
> doesn't seem to be quite true to me -- e.g., I don't see a use case for
> exchange of provenance information with non-IVOA software ("standard
> model") or even the links to other IVOA DMs.  I don't dispute these are
> sane requirements, of course.  Can't you just write: "We adopt the
> following requirements for the Prov DM"?
>
> (d) In the requirements, I'm not terribly happy about "if applicable"
> and friends.  Can't you, for instance, say just *which* activities are
> exempt from having to have input entities?  Sure, if that gets too
> verbose, it's counter-productive, but perhaps a few words can already go
> a long way towards making the requirements a bit more precise?
>
> (e) "Activities may point to output entities." -- why just "may"?  What
> purpose could an output-less activity serve?
>
> (f) "Entities, Activities and Agents [...] should have persistent
> identifiers." I wouldn't do this -- many entities are fairly ephemeral,
> and even recommending to obtain a DOI for, say, a flatfield is, I think,
> going much too far.  Similarly, not everyone may want to have an ORCID
> or spread it in a provenance database (and I'm not getting started on
> the GDPR here).  And no, "it's optional" doesn't invalidate that
> point: if it's a SHOULD a tool would still drop warnings if your
> flatfield doesn't have a DOI, and that can very well hide actual
> problem  Can't we just strike any language on PIDs here?
>
> (g) Fig. 3, "main core classes".  I'm still unconvinced the
> wasDerivedFrom and wasInformedBy relations are a good idea in our
> context.  I realise they are shortcuts and thus might seem convenient
> for people *generating* provenance instances.  However, many more people
> will consume them.  To them, every feature you add is extra work, and
> they'd probably have to de-serialise your shortcuts into null activities
> or null entities.  Which they won't appreciate.
>
> Also, since you could just as well generate these null entities or
> activities yourself (i.e., in your provenance instances), these two
> additional relationships introduce multiple ways to represent the
> same thing.  That's always an indication for a feature that will lead
> to headache later.
>
> So, let me plead again: Are the shortcuts *really* so valuable to you
> that it's worth burdening our implementors with them?
>
> I also don't find too convincing the rationale for wasDerivedFrom on
> p. 14, "If there is more than one input and more than one output to
> an activity, it is not clear which entity was derived from which".
> If you have an activity with multiple inputs and outputs, it stands
> to reason that all inputs influence all outputs, so there's nothing
> for wasDerivedFrom to annotate.  If there's distinct, unrelated
> groups of inputs and outputs then you really have two activities and
> you should describe them as such rather than hack around the
> deficient description.
>
> Similarly for the "deemed to be not important enough to be recorded in a
> pipeline" on wasInformedBy.  The overhead of introducing an Entity is
> really not high (unless of course you require persistent identifiers for
> them...).  And nothing is so insignificant that a few words of
> description couldn't come in handy when someone reads a provenance
> graph.
>
> And then "state that an activity communicates with another"... hm --
> that's not provenance, that's activity description ("workflow"), no?
>
> (i) Table 1, "attributes of the Entity class": From my Registry
> experience, "rights" as specified here has been profoundly useless (in
> 10 years of having it in the Registry nobody has used it as designed);
> in VOResource 1.1 we therefore moved to DataCite's model of copyright
> and licensing information, which I'd recommend here as well if I
> didn't recommend removing rights here in the first place.  You see, I
> don't think this is provenance's turf -- it's not in W3C PROV either.
> What use case did you have in mind for that?
>
> (j) Table 1, "attributes of the Entity class": if W3C PROV calls the
> description "description", and most everything else in the VO has
> "description", is there any deep reason you're using "annotation"?
> What would break if you used "description", too?
>
> (k) Table 1, "attributes of the Entity class": in the caption you offer
> "url" as a "project-specific attribute" -- how would that be
> different from the standard "location" attribute?  What should a
> client do if there is both url and location?
>
> (l) Sect. 2.1.2 cites the "Dataset Metdata Model" -- since DatasetDM has
> a large overlap with ProvDM, and DatasetDM hasn't seen activity since
> March 2016, I'd rather not reference it here (as it says in opening
> material of WDs: 'It is inappropriate to use IVOA Working Drafts as
> reference materials or to cite them as other than "work in progress".').
> My hope is still that once ProvDM is there we can perhaps create a
> version of DatasetDM with a clear separation of concerns with ProvDM.
> If that happens, we'll be happy if we we've kept recursive
> dependencies at a minimum here.  A similar argument applies to 2.1.6.
>
> (m) Sect 2.1.4 Activity -- what's the rationale for making startTime and
> endTime mandatory?  Is there actually software that would become more
> complex if it couldn't rely on these?  As an occasional user of provenance
> information, I have to say time was one of the processing attributes
> I've used less often (compared to, say a description or the parameters
> of the processing step).
>
> (n) Sect 2.1.4 Activity -- I'm very skeptical of the "status" attribute.
> Do you really want to record failed activities?  If so, at least
> precisely define what you can have in status and define what it's for
> (a use case in section 1.1 would also be helpful).  As a cautionary
> tale, the Registry lets people say that Resources can be active,
> inactive or deleted (in addition to the sensible deleted flag on the
> OAI-PMH level).  Few VOResource features have wreaked more havoc,
> while really giving one nothing over what OAI-PMH already has.  It's
> really much safer if you say "if it broke, don't advertise it".
>
> (o) Sect. 2.1.5 Used/@time, WasGeneratedBy/@time -- are there really
> important use cases in which these couldn't be replaced by the
> activity's startTime and endTime (operationally, not concenptually)?
> Again, each extra feature puts a burden on the implementors, and I
> have a hard time imagining use cases in which this granularity would
> be necessary (if there are, you should really put them into Sect.
> 1.1).
>
> (p) 2.1.6 Agent, WasAttributedTo/@role.  Rather than provide an "e.g."
> table of terms in the document, why don't you create a vocabulary
> right away?  There's nice tooling for this -- just ask me if
> interested.  But I'll admit right away I'm not terribly happy with
> the list of terms as it stands now -- if you look at the DataCite
> metadata kernel (https://support.datacite.org/docs/schema-40),
> contributorType, there are many overlaps with your list -- can't you
> re-use/reference what DataCite has?  You see, it would suck if I had
> to introduce some static mapping between my DataCite metadata and
> provenance metadata I may need to write somewhere.
>
> (q) Extended Model.  I admit to not having reviewed it.  I'd strongly
> vote for having the core model put into REC frist and only then going
> for the extended model.
>
> My impression is it's difficult enough to get core right.
>
> As ProvDM-Core is taken up, we can figure out what else we need and
> what might already be covered sufficiently well by core.  Which of
> your use cases would you have to drop until something like the
> extended metadata were standardised?
>
> (r) Serialisation, Introduction: 'For FITS files, a provenance extension
> called "PROVENANCE" could be added which contains provenance information
> of the activities that generated the FITS file.'  Please let's avoid
> subjunctive language in specs -- it helps nobody ("should I implement
> this, now, or shouldn't I?").
>
> Either say "To include provenance information into a FITS file,
> generate a PROV-N string and write it as the array of a PROVENANCE
> extension (BITPIX=8)" (or whatever) or don't say anything at all.
>
> (s) Sect 3.3, "VOTable Format" -- as I said in my last review, I don't
> see what purpose this VOTable serialisation serves.  At least "emphasize
> the compatibility" is far to weak a reason for putting something into a
> standard.in my book  Remember, people have to implement this stuff.
>
> I'm not arguing against a relational mapping of your model, but that
> needs to be defined much more carefully (presumably in ProvTAP, then).
> I strongly vote to remove the entire section 3.3; but if you don't
> remove it, it needs to explain *much* better what to do where (and why).
>
> (t) Sect 3.4 "Description classes for web services" -- it's a cute idea,
> but it's so far from provenance that it really doesn't belong in this
> specification.  If you think there's a use case for this, please
> transport this material into the DataLink specification (there's going
> to be an update for it anyway fairly soon).  Nobody will look for
> material like this in the documentation for the provenance DM.
>
> (u) Sect 4 "Acessing provenance information" -- my advice is to strike
> this section and integrate what little material there still is in it into
> the introduction.
>
> (v) Appendix A "Examples" -- wouldn't it be enough to just show (perhaps
> an abbreviated rendition of) the PROV-N example and tell people how to
> use standard software to get to the PROV-JSON one?  As to VOTable, see
> above.
>
> Note that ivoatex also has an auxiliaryurl macro that you can use to
> deliver example files without having to include them verbatim in the
> document (see ivoatexDoc).
>
> Apart from reducing the scaryness of the document (shaving off 10
> pages downgrades it from OMG-60-pages! to Oh-dang-50-pages, which
> probably helps adoption a lot, shortening the examples section to
> what humans actually want or need to see probably saves a few trees,
> too -- IVOA documents are printed occasionally...
>
> (w) Appendix B "Links to other DMs" -- oh my.  "When delivering the data on
> request, the serialized versions can be adjusted to the corresponding no-
> tation." -- excuse me, but that won't work.  If I get a request, how am
> I to know if I should include ProvDM or DatasetDM metadata?  What
> technical reason should there be to distinguish between the two?
>
> No, I'm sorry, but we simply have to clean up our act.  There needs
> to be at most one model per piece of reality.  DatasetDM fortunately
> isn't REC yet -- it still can be re-written to use your classes where
> appropriate.
>
> With SimDM I'd not be worried too much -- people doing it probably won't
> bother with ProvDM or much else anyway.  I suppose it'd be all right to
> just say in the introduction something like "For historical reasons,
> SimDM has its own rendition of provenance; we make no effort of
> reconciling the current and W3C's efforts with it".
>
> I'm also not convinced that appendix on UWS (B.3) couldn't be shortened
> into a paragraph in the introduction; that, of course, is even easier if
> we keep to the core model and thus won't have to explain about
> Parameter.
>
> B.4, finally, is too much of a "could be further developed" thing.  I'm
> always advocating keeping promises for the future out of standard
> documents unless they actually help to keep implementors from making
> false assumptions.  This doesn't seem to be the case here.  Can we
> remove it?
>
>
> I've also done some very minor changes that I hope are uncontroversial
> in rev. 5209.
>
>             -- Markus
>

-- 
-------------------------------------------------------
Anastasia Galkin
Supercomputing and E-Science

Email: agalkin at aip.de
Phone: +49 331 7499-685
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------



More information about the dm mailing list