IVOA Provenance DM -RFC- answers to comments

Tue Nov 20 15:17:43 CET 2018

Dear DM,

Coming back from vacation, let me join the fray;  I have a few more
replies to Mireille's original answer that I'll post later as
replies there.  To keep the discussion reasonably focused, I'll
keep this on the question of the "shortcuts" (wasDerivedFrom and
wasInformedBy).

<soapbox>
A general topic of all these things, however, is WAWSFC: We are
writing standards for clients, i.e., for software that interprets our
annotations and let the users work with them.  Each feature we add
makes their job harder.  If their job is harder, they may not do it,
and our standard remains unimplemented.  And if it's a hard job, even
if they decide to do it, it's much more likely they'll get it wrong,
and then people will see tracebacks instead of interoperability.

*That's* why I'm haggling here to keep things lean.  And yes, we
shouldn't be *too* lean.  When figuring out what the line is, use
cases and their derived requirements are about the only thing that
helps us, and that's why I keep insisting on being able to trace
features to use cases.
</soapbox>

On Tue, Nov 06, 2018 at 02:11:43PM +0100, Mathieu Servillat wrote:
> https://www.w3.org/TR/prov-dm/#term-Derivation). A generic client should
> answer the question: does this entity has a wasDerivedFrom relation ?

Hm -- what exactly is the use case for this question?  In the
debugging use case, isn't the question rather "Is Entity A in the
progenitors of Entity B?"  Is that what you mean here?

If so, then I'd say a client has an easier time without wasDerived
from.  You see, all it has to do then is search the provenance graph
for this structure recursively:

Entity_A ---- wasGeneratedBy ---> Activity_X ---- used ---> Entity_B (a)

Not overly pretty, yes, but at least it's just one thing, and you
have Activity you can use to reliably detect cycles (entities aren't
nearly as good for that because they might legitimately be used
multiple time in the provenance of even a single thing).

Now, with wasDerivedFrom, a client *additionally* has to go through
all structures of the type

Entity_A ---- wasDerivedFrom -----> Entity_B                         (b)

Yes, that, in itself, is simpler, but just because it's there doesn't
mean you'd not have to check for pattern (a), too.  So, you now need
to consider all kinds of mixtures of paths, do cycle detection even
for mixes of patterns (a) and (b), and so on.  I'd say roughly
quadrupled complexity on the client side.

This is an argument to make the point:  "wasDerivedFrom comes as a
cost".

Granted, if that cost actually is *too* high to make wasDerivedFrom
worthwhile for the VO DM I can't say -- as provenance goes, I don't
have experience worth mentioning.  But I'm asking everyone involved
to make sure they've carefully worked out that questions for
themselves.

> Here is the paragraph on derivation in the PR: "Note that the
> \class{WasDerivedFrom} relation cannot always automatically be
> inferred from following existing \class{WasGeneratedBy} and
> \class{Used} relations alone.  If there is more than one input and
> more than one output to an activity, it is not clear which entity
> was derived from which. Only by specifying the descriptions and
> roles accordingly, or by adding a \class{WasDerivedFrom} relation,
> this direct derivation becomes known."

Hmyes, but in my original response I wondered:

  If you have an activity with multiple inputs and outputs, it stands
  to reason that all inputs influence all outputs, so there's nothing
  for wasDerivedFrom to annotate.  If there's distinct, unrelated
  groups of inputs and outputs then you really have two activities and
  you should describe them as such rather than hack around the
  deficient description.

-- and I still don't see that point sufficiently addressed.  Put less
abstractedly following Ole's example: How would I *not* want to know
about a flatfield used in the production of an optical image?

> Here is a preliminary diagram of what can be the calibration data flow for
> CTA:
> https://banshee.obspm.fr/index.php/s/BRuf26L1sdX085u
> Please let me have derivations, at least between data levels (e.g. DL0 to
> DL1, so I don't have to dig in all the complex relations to find the main
> progenitors. Also, I don't want the parameters, the descriptions, the
> context or other side entities of my activities to be exposed automatically
> as progenitors. Used+wasGeneratedBy does not mean wasDerivedFrom all the
> time. The precise derivations can be explained textually in the
> descriptions, but the derivation relation helps to find automatically
> relevant provenance information in the mass of provenance data.

So, yes, provenance graphs, modelled to sufficient detail, will end
up being rather complex.  But that is exactly why I so yearn for
keeping the model itself as simple as we possibly can: Throwing more
complexity at a problem to make it more manageable has rarely worked
(I'm not saying it never worked, though).

Isn't your point here rather an argument for a hierarchical
representation, where you'd have "top-level" entities and "top-level"
activiites in a "top-level" provenance graph, wher you could then
"drill down" into finer-grained provenance, HiPS-style?

If so, I'd suggest that's not really a modelling issue.  As long as
there is a defined way in which clients can do the "drilling down"
(essentially links "go here for a finer-grained provenance" and "go
here for coarser-grained provenance"), the model can remain as it is.

An alternative might be a "saliency" annotation to entities and/or
activities.  So, are we actually the first to struggle with
provenance graphs of that complexity? If not, do we know what others
have done to cope?  And were they happy with what they went for?

> Here is the page of the working group with discussions, probably not
> everything is contained in the minutes of the discussion, but this gives a
> good idea of the topics discussed, e.g. on derivation sometimes. Sorry if
> the draft does not contain all those discussions, for obvious reasons, but
> the paragraph in the PR does not come from nothing.
> http://wiki.ivoa.net/twiki/bin/view/IVOA/ObservationProvenanceDataModel

Ouch.  That's a bit much to comb through for me as a half-casual
reviewer that just wants to humbly annotate time series points with
their originating images.  If there's anything I should be aware of
in particular, could you perhaps provide a more direct link?

And while I'm dwelling on this point, let me put in a brief piece of
Mireille's original reply:

On Sun, 4 Nov 2018 20:26:09 +0100, Mireille wrote:
Mireille > In the Triplestore implementation for instance it really
Mireille > speeds up the search.  In the relational DB it avoids table joins.

I don't dispute these -- in a concrete implementation, it might be
advantageous to abstract away the actual activities.  But that
doesn't mean the interoperable model has to be burdened by that
optimisation.  If this turns out to be a good idea, you could even
require a (progenitor, successor) table in a relational mapping of
the model without having to have that in the model itself.

> I would also like to discuss a more important topic: what is relevant
> provenance information? The W3C structure allows anyone to store a huge
> mass of provenance *data*, however, only part of it is relevant provenance
> *information*. The proposed extended model for the astronomy domain aims at
> guiding projects to store the information that is relevant in astronomy.
> But it is not sufficient, a project should then select precisely the
> relevant provenance information for their application, i.e. maybe not
> everything should be recorded, just the minimum relevant information.

That, I think, captures about my sentiment as to why we should keep
the first version of this model really small -- at least until we
feel we understand sufficienly well what people need and what clients
(want to) do with our annotations.

It's easy to add things to standards, but very hard to take things
away once the standard is passed.

With apologies for another long mail,

        Markus