IVOA Provenance DM -RFC- answers to comments

Mathieu Servillat mathieu.servillat at obspm.fr
Wed Nov 21 17:45:56 CET 2018


Hi Markus and all,

first, an information to all: the RFC Review Period has now ended, after
several meetings during the College Park interop with the DM chairs, the
document will now evolve for another round to include the comments, based a
more generic model.

As for your points, I will certainly keep in mind that additional features
imply additional implementation work, so that those new features must come
from use cases with good reasons. We indeed discussed often where the line
should be between lean and not too lean, however, a "simpler" model is
sometimes just hiding the complexity behind, with not enough guidance. The
risk is that implementations are then not coordinated enough and
interoperability is quickly lost.

Derivation: my first example was the usage of configuration parameters,
which are not progenitors but can be seen as input entities. I tried a more
simple example with the flatfied, but I now see that we have no definition
of what a progenitor is. I assumed that the "science progenitor" was the
obvious one, while the "calibration progenitor" was secondary (this
difference can be highlighted by a wasDerivedFrom relation by the way). But
you are right, technically they are both progenitors. Now, for a more
precise example, let's define an activity that takes as inputs A, B and C
and returns as results D and E. The internal calculations are in fact D=A+B
and E=C+D. In that case D is *not* derived from C... You may say that the
activity is then not well defined, that D is an intermediate result, but
this is the real world, and one cannot be forced to do fine grained
provenance. This was summarized in the PR by "If there is more than one
input and more than one output to an activity, it is not clear which entity
was derived from which." This affirmation is maybe hiding the detailed
examples, and could be reformulated, the thing is that "used+generated" is
not equivalent to "derived".

When discussing this at the ProvenanceWeek (more general provenance
meeting, not just astronomy), it was explained to me that the main
information we generally extract from provenance records is the data flow
(i.e. not the activity flow). Data (or entities) are connected between them
with the wasDerivedFrom relation, which makes it the most basic structure
to carry provenance information.

We also have to make clear that the "core model", being in fact the W3C
core model, it is an *informative part* of the document. The "extended
model" is the normative part, and brings meaning to the entities, hence
some guidance. I don't see the extended model as being too expensive for
clients as the normative names will be carried by a "type" attribute for
entities, activities and relations, and connected to a restricted
vocabulary.

Thanks for your inputs, we will now discuss all the RFC answers and review
the document,

Best regards,
Mathieu



Le mar. 20 nov. 2018 à 15:22, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> a écrit :

> Dear DM,
>
> Coming back from vacation, let me join the fray;  I have a few more
> replies to Mireille's original answer that I'll post later as
> replies there.  To keep the discussion reasonably focused, I'll
> keep this on the question of the "shortcuts" (wasDerivedFrom and
> wasInformedBy).
>
> <soapbox>
> A general topic of all these things, however, is WAWSFC: We are
> writing standards for clients, i.e., for software that interprets our
> annotations and let the users work with them.  Each feature we add
> makes their job harder.  If their job is harder, they may not do it,
> and our standard remains unimplemented.  And if it's a hard job, even
> if they decide to do it, it's much more likely they'll get it wrong,
> and then people will see tracebacks instead of interoperability.
>
> *That's* why I'm haggling here to keep things lean.  And yes, we
> shouldn't be *too* lean.  When figuring out what the line is, use
> cases and their derived requirements are about the only thing that
> helps us, and that's why I keep insisting on being able to trace
> features to use cases.
> </soapbox>
>
>
> On Tue, Nov 06, 2018 at 02:11:43PM +0100, Mathieu Servillat wrote:
> > https://www.w3.org/TR/prov-dm/#term-Derivation). A generic client should
> > answer the question: does this entity has a wasDerivedFrom relation ?
>
> Hm -- what exactly is the use case for this question?  In the
> debugging use case, isn't the question rather "Is Entity A in the
> progenitors of Entity B?"  Is that what you mean here?
>
> If so, then I'd say a client has an easier time without wasDerived
> from.  You see, all it has to do then is search the provenance graph
> for this structure recursively:
>
> Entity_A ---- wasGeneratedBy ---> Activity_X ---- used ---> Entity_B (a)
>
> Not overly pretty, yes, but at least it's just one thing, and you
> have Activity you can use to reliably detect cycles (entities aren't
> nearly as good for that because they might legitimately be used
> multiple time in the provenance of even a single thing).
>
> Now, with wasDerivedFrom, a client *additionally* has to go through
> all structures of the type
>
> Entity_A ---- wasDerivedFrom -----> Entity_B                         (b)
>
> Yes, that, in itself, is simpler, but just because it's there doesn't
> mean you'd not have to check for pattern (a), too.  So, you now need
> to consider all kinds of mixtures of paths, do cycle detection even
> for mixes of patterns (a) and (b), and so on.  I'd say roughly
> quadrupled complexity on the client side.
>
>
> This is an argument to make the point:  "wasDerivedFrom comes as a
> cost".
>
> Granted, if that cost actually is *too* high to make wasDerivedFrom
> worthwhile for the VO DM I can't say -- as provenance goes, I don't
> have experience worth mentioning.  But I'm asking everyone involved
> to make sure they've carefully worked out that questions for
> themselves.
>
>
> > Here is the paragraph on derivation in the PR: "Note that the
> > \class{WasDerivedFrom} relation cannot always automatically be
> > inferred from following existing \class{WasGeneratedBy} and
> > \class{Used} relations alone.  If there is more than one input and
> > more than one output to an activity, it is not clear which entity
> > was derived from which. Only by specifying the descriptions and
> > roles accordingly, or by adding a \class{WasDerivedFrom} relation,
> > this direct derivation becomes known."
>
> Hmyes, but in my original response I wondered:
>
>   If you have an activity with multiple inputs and outputs, it stands
>   to reason that all inputs influence all outputs, so there's nothing
>   for wasDerivedFrom to annotate.  If there's distinct, unrelated
>   groups of inputs and outputs then you really have two activities and
>   you should describe them as such rather than hack around the
>   deficient description.
>
> -- and I still don't see that point sufficiently addressed.  Put less
> abstractedly following Ole's example: How would I *not* want to know
> about a flatfield used in the production of an optical image?
>
> > Here is a preliminary diagram of what can be the calibration data flow
> for
> > CTA:
> > https://banshee.obspm.fr/index.php/s/BRuf26L1sdX085u
> > Please let me have derivations, at least between data levels (e.g. DL0 to
> > DL1, so I don't have to dig in all the complex relations to find the main
> > progenitors. Also, I don't want the parameters, the descriptions, the
> > context or other side entities of my activities to be exposed
> automatically
> > as progenitors. Used+wasGeneratedBy does not mean wasDerivedFrom all the
> > time. The precise derivations can be explained textually in the
> > descriptions, but the derivation relation helps to find automatically
> > relevant provenance information in the mass of provenance data.
>
> So, yes, provenance graphs, modelled to sufficient detail, will end
> up being rather complex.  But that is exactly why I so yearn for
> keeping the model itself as simple as we possibly can: Throwing more
> complexity at a problem to make it more manageable has rarely worked
> (I'm not saying it never worked, though).
>
> Isn't your point here rather an argument for a hierarchical
> representation, where you'd have "top-level" entities and "top-level"
> activiites in a "top-level" provenance graph, wher you could then
> "drill down" into finer-grained provenance, HiPS-style?
>
> If so, I'd suggest that's not really a modelling issue.  As long as
> there is a defined way in which clients can do the "drilling down"
> (essentially links "go here for a finer-grained provenance" and "go
> here for coarser-grained provenance"), the model can remain as it is.
>
> An alternative might be a "saliency" annotation to entities and/or
> activities.  So, are we actually the first to struggle with
> provenance graphs of that complexity? If not, do we know what others
> have done to cope?  And were they happy with what they went for?
>
> > Here is the page of the working group with discussions, probably not
> > everything is contained in the minutes of the discussion, but this gives
> a
> > good idea of the topics discussed, e.g. on derivation sometimes. Sorry if
> > the draft does not contain all those discussions, for obvious reasons,
> but
> > the paragraph in the PR does not come from nothing.
> > http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel
>
> Ouch.  That's a bit much to comb through for me as a half-casual
> reviewer that just wants to humbly annotate time series points with
> their originating images.  If there's anything I should be aware of
> in particular, could you perhaps provide a more direct link?
>
> And while I'm dwelling on this point, let me put in a brief piece of
> Mireille's original reply:
>
> On Sun, 4 Nov 2018 20:26:09 +0100, Mireille wrote:
> Mireille > In the Triplestore implementation for instance it really
> Mireille > speeds up the search.  In the relational DB it avoids table
> joins.
>
> I don't dispute these -- in a concrete implementation, it might be
> advantageous to abstract away the actual activities.  But that
> doesn't mean the interoperable model has to be burdened by that
> optimisation.  If this turns out to be a good idea, you could even
> require a (progenitor, successor) table in a relational mapping of
> the model without having to have that in the model itself.
>
>
> > I would also like to discuss a more important topic: what is relevant
> > provenance information? The W3C structure allows anyone to store a huge
> > mass of provenance *data*, however, only part of it is relevant
> provenance
> > *information*. The proposed extended model for the astronomy domain aims
> at
> > guiding projects to store the information that is relevant in astronomy.
> > But it is not sufficient, a project should then select precisely the
> > relevant provenance information for their application, i.e. maybe not
> > everything should be recorded, just the minimum relevant information.
>
> That, I think, captures about my sentiment as to why we should keep
> the first version of this model really small -- at least until we
> feel we understand sufficienly well what people need and what clients
> (want to) do with our annotations.
>
> It's easy to add things to standards, but very hard to take things
> away once the standard is passed.
>
> With apologies for another long mail,
>
>         Markus
>


-- 
Dr. Mathieu Servillat
Laboratoire Univers et Théories, Bât 18, Bur. 221
Observatoire de Paris-Meudon
5 place Jules Janssen
92195 Meudon, France
Tél. +33 1 45 07 78 62
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20181121/1bc567c6/attachment.html>


More information about the dm mailing list