New ProvenanceDM working draft released, part II

Thu Oct 12 23:33:54 CEST 2017

Hi Markus, DM,

before answering the points below: I've put the main remaining 
discussion points at the wiki page
http://wiki.ivoa.net/twiki/bin/edit/IVOA/ObservationProvenanceDataModel.

We'll discuss them at our next provenance telecons and meetings, if they 
are not solved on the mailing list already.
Some of the minor changes suggested by you are already included in the 
current version of the draft on volute (revision 4518).

--------------------------------------------
> Table 17, the mapping between ProvenanceDM and DatasetDM labels.
> Frankly, this makes me weep.  Is it *really* not possible to use a
> common nomenclature, or rather common types, even between IVOA DMs?  I
> have a hard time imagining how I'd implement this, in particular with a
> view to "This list is not complete".  And no, "the serialised versions
> can be adjusted to the corresponding notation" is not reassuring at all.
> How on earth is a piece of software supposed to know what "corresponds"
> to a given request?  My impression: For interoperability with the wider
> world (W3C), DatasetDM should budge and just use ProvDM classes
> whereever possible.

In DatasetDM things are structured differently than in ProvenanceDM. For 
example, we tried to find a good way to integrate 
Entity/EntityDescription with the Dataset-class, but since attributes 
from Entity and its Description as well as curation details (and thus 
links to Agents) are combined in Dataset, we couldn't.
If we want to "marry" them (and also other 
similar-yet-not-the-same-classes), this for sure would mean major 
changes in both models ...

> Table 19 -- oh bother.  Any chance for a SimDM 2.0 that avoids the
> duplication of classes?  As far as I can see, there's not terribly much
> SimDM 1.0 content out there yet, so perhaps a version 2.0 wouldn't break
> too much at this point, would it?

There are reasons to have this duplication of classes - e.g if there are 
many (thousands) experiments run that have the same protocol, but with 
slightly different parameters. One could argue that this can be just an 
optimisation in the implementation, but when serialising the thousands 
experiments, references to a common protocol instead of replicating all 
its properties can still become very handy. We use the benefit of that 
also in our model.

> Sect. 4.3, PROV-VOTable -- I'm *really* unhappy that a VO-DML-defined
> data model defines a "custom" serialisation while the authors of the
> "mapping" standard (that's supposed to define how such DMs are to be
> represented in VOTable in general) work on something that looks entirely
> different.  Everyone will eventually regret that.  So, please, *please*
> don't have sect. 4.3; instead, help out the VO-DML mapping folks and
> perhaps fill in any parameters left open in there in a version 1.1 of
> ProvDM.

I admit that I haven't looked into the mapping standard for a while, and 
will most likely have no time in the future to do so. So I hope that 
other people can join in and give a hand with that.
Will the VOTable generated using the VO-DML mapping standard really look 
"entirely different" than what we suggest here?

> Sect. 4.2 vs. sect. 4.5 -- I don't quite understand what purpose the
> "almost-W3C" serialisations in 4.2 serve -- why not go for the 4.5
> solution right away?  Ok, the need for a bit of mapping is a slight
> uglification, but the total pollution of the IT environment is a lot
> less with this little blemish than if you have a similar but
> incompatible format *and* the 4.5 thing *on top*.

I'd like to do some more tests how to map e.g. the description classes 
into W3C. The idea was that the VO-serialisations can be used to 
exchange the provenance metadata, fully preserving the VO provenance 
structure. So one could retrieve a provenance serialisation from a 
ProvDAL service and upload it to a (Prov-)TAP service to investigate it 
further. I'm willing to drop the VO-serialisations in favour of fully 
W3C-compatible serialisations, if we can be sure that the original 
structure can be preserved/recovered. I need to see some more 
examples/try implementations to verify if this can work - or not.

> Sect 5 -- I'm not sure if I'm terribly happy to have an access protocol
> folded into this already fairly long document (on the other hand: the
> less standards the better).  

Yeah, it all just grows with time ... at its birth ProvDAL was just half 
a page. If it gets too big, it can move to a separate document, if needed.

> But one thing I'm sure about is that you
> don't want two access protocols.  It'll be hard enough to gain uptake
> for one of them.  If you let people choose which one to implement, in
> all likelihood half will use one protocol, and the other half the other,
> thus at least doubling the implementation effort for client authors.  My
> take: There are enough TAP engines available that there is no reason not
> to just use TAP plus a canonical relational DM representation,
> preferably built according to standard, VO-DML rules.

I like ProvDAL - I think it's fairly straight-forward to implement on 
the server-side. I implemented it in my django-prototype application for 
RAVE provenance (it's described in the implementation note, see 
https://volute.g-vo.org/svn/trunk/projects/dm/provenance/implementation-note/). 
There were 2 reasons:
1) Francois was asking for an example serialisation of my use case
2) I wanted W3C compatible serialisations that I can upload to 
Southampton's ProvStore and share/visualise.

So I needed something that could produce proper serialisations in 
different formats, allows me to just extract 1 single step backwards (in 
time) or more; and it should return a reasonable amount of information 
(e.g. not all the members of a collection, unless I explicitly ask for 
them). Thus, even if at some point we decide to drop ProvDAL, I would 
still implement some kind of interface like this.

A for ProvTAP: Mireille and Francois are making good progress 
implementing it (I hope :-)).
ProvTAP shall enable users to search in provenance metadata or select 
datasets based on their provenance.
However, I fear that it's going to be a nightmare for an ADQL user to 
write queries for a relational database - one has to do many joins for 
each step, and it's a recursive process to extract the progenitors of 
progenitors of an entity, with usually no way to know beforehand how 
deep one can go.
All this can be hidden in the client, of course, (and I'm doing these 
join-queries server-side in my ProvDAL implementation), but then it's 
not just a (simple) TAP client with a text field for the ADQL query.

Having said that, I still think ProvTAP can be very useful for power 
users or powerful clients.
I just think that ProvDAL and ProvTAP are meant for different purposes.
We have to check if one could really cover everything ProvDAL is 
supposed to give us using ProvTAP. E.g. the results from a ProvTAP query 
won't be directly digestible by W3C Provenance tools; one would always 
need an additional conversion step to get the structure of a valid W3C 
serialisation format (even just converting the TAP response to JSON 
would not produce a valid PROV-JSON).

> p. 43 -- "note that the relations wasDerivedFrom..." -- I'll mention
> that the mere fact that these "optimisations" bite you in protocol
> design for me is another indication that they shouldn't be put into the
> standard in the first place.

We'll think about it ...

> Sect 5.3 -- While, as I said, I think having a relational mapping of the
> model is an excellent idea, sect. 5.3 is not enough for a REC.  To make
> this implementable, you'd at least have to say
> 
> * what tables, what columns, what column types make up the model -- I'd
>    hence prefer if appendix B got a bit more comments and went here
>    (which is not a problem if you dump PROV-VOTable)
> * perhaps how to discover the ProvTAP service for a given PubDID or
>    other identifier?
> * do these guys live in a schema?  Any schema, perhaps, so a single TAP
>    service can keep multiple, independent provenance stores?
> * giving a data model identifier for TAPRegExt indicates there can only
>    be one provenance store per TAP service -- is that really what you
>    want?  My recommendation for the future is to use URIs in utype
>    attributes of schema or table elements.
> 
> And again, I think as much as possible of this should come out of a
> defined process valid for all VO-DML DMs.

Okay, we need to discuss this in our group as well. I haven't used any 
schema so far. But yes, I think a TAP service should be able to keep 
more than one provenance store. I don't get the last point, you lost me 
there: can you please explain it in more detail - maybe with an example?

> Sect. 6 -- That content doesn't seem to be REC-level material to me; if
> you write another, note-type, document anyway, why not push it there?

Yes, we could do that. We thought it would be nice to have some real use 
cases in the document already, but it could be moved to the 
implementation note.

> General point I:  Currently, my main use case would be:
> 
>    In a VOTable containing a photometric time series, declare the column
>    in which the source images are linked
> 
> This would be immediately actionable by clients and, I think, obviously
> useful (people configure such functionality by hand in TOPCAT right now,
> but of course that only scales so far when you start to automate things
> or venture into less familar data).  From the current document I can't
> really see how to even start with this.  Granted, I've not put in this
> use case into your wiki in time.  But I think if you just employed
> standard VO-DML mapping (and helped bring that out of the door instead
> of inventing a custom solution), that use case would, more or less, fall
> into your laps.
> General point II: An important motivation for modeling provenance in the
> first place was a standardisation of the plethora of FITS keywords
> describing ambient conditions and perhaps even instrument telemetry,
> also with a view to go beyond FITS at some point.  While I can see why
> the current document cannot provide the necessary vocabularies, I think
> it would be great if it provided clear guidance as to what would be
> required to make it happen (i.e., presumably discuss one or more
> vocabularies).

Ok, we'll look into these 2 points.

Cheers,
Kristin

-- 
-------------------------------------------------------
Dr. Kristin Riebe
Press and Public Outreach

Email: kriebe at aip.de, webmaster at aip.de
Phone: +49 331 7499-377
Room:  Bib/3
-------------------------------------------------------
Leibniz-Institut für Astrophysik Potsdam (AIP)
An der Sternwarte 16, D-14482 Potsdam
Vorstand: Prof. Dr. Matthias Steinmetz, Matthias Winker
Stiftung bürgerlichen Rechts
Stiftungsverzeichnis Brandenburg: 26 742-00/7026
-------------------------------------------------------