New ProvenanceDM working draft released, part II

Fri Oct 13 07:09:05 CEST 2017

Hi Kristin, all

My answer to Markus on PROV-VOTable, PROV-DAL and PROV-TAP has been 
wrtten exactly in Parallel with yours.

I agree with what you say on that here (specifically on ProvTAP versus 
provDAL)

Cheers

François

PS: more on VODML mapping in my answer to Omar

Le 12/10/2017 à 23:33, Kristin Riebe a écrit :
> Hi Markus, DM,
>
> before answering the points below: I've put the main remaining 
> discussion points at the wiki page
> http://wiki.ivoa.net/twiki/bin/edit/IVOA/ObservationProvenanceDataModel.
>
> We'll discuss them at our next provenance telecons and meetings, if 
> they are not solved on the mailing list already.
> Some of the minor changes suggested by you are already included in the 
> current version of the draft on volute (revision 4518).
>
> --------------------------------------------
>> Table 17, the mapping between ProvenanceDM and DatasetDM labels.
>> Frankly, this makes me weep.  Is it *really* not possible to use a
>> common nomenclature, or rather common types, even between IVOA DMs?  I
>> have a hard time imagining how I'd implement this, in particular with a
>> view to "This list is not complete".  And no, "the serialised versions
>> can be adjusted to the corresponding notation" is not reassuring at all.
>> How on earth is a piece of software supposed to know what "corresponds"
>> to a given request?  My impression: For interoperability with the wider
>> world (W3C), DatasetDM should budge and just use ProvDM classes
>> whereever possible.
>
> In DatasetDM things are structured differently than in ProvenanceDM. 
> For example, we tried to find a good way to integrate 
> Entity/EntityDescription with the Dataset-class, but since attributes 
> from Entity and its Description as well as curation details (and thus 
> links to Agents) are combined in Dataset, we couldn't.
> If we want to "marry" them (and also other 
> similar-yet-not-the-same-classes), this for sure would mean major 
> changes in both models ...
>
>> Table 19 -- oh bother.  Any chance for a SimDM 2.0 that avoids the
>> duplication of classes?  As far as I can see, there's not terribly much
>> SimDM 1.0 content out there yet, so perhaps a version 2.0 wouldn't break
>> too much at this point, would it?
>
> There are reasons to have this duplication of classes - e.g if there 
> are many (thousands) experiments run that have the same protocol, but 
> with slightly different parameters. One could argue that this can be 
> just an optimisation in the implementation, but when serialising the 
> thousands experiments, references to a common protocol instead of 
> replicating all its properties can still become very handy. We use the 
> benefit of that also in our model.
>
>> Sect. 4.3, PROV-VOTable -- I'm *really* unhappy that a VO-DML-defined
>> data model defines a "custom" serialisation while the authors of the
>> "mapping" standard (that's supposed to define how such DMs are to be
>> represented in VOTable in general) work on something that looks entirely
>> different.  Everyone will eventually regret that.  So, please, *please*
>> don't have sect. 4.3; instead, help out the VO-DML mapping folks and
>> perhaps fill in any parameters left open in there in a version 1.1 of
>> ProvDM.
>
> I admit that I haven't looked into the mapping standard for a while, 
> and will most likely have no time in the future to do so. So I hope 
> that other people can join in and give a hand with that.
> Will the VOTable generated using the VO-DML mapping standard really 
> look "entirely different" than what we suggest here?
>
>> Sect. 4.2 vs. sect. 4.5 -- I don't quite understand what purpose the
>> "almost-W3C" serialisations in 4.2 serve -- why not go for the 4.5
>> solution right away?  Ok, the need for a bit of mapping is a slight
>> uglification, but the total pollution of the IT environment is a lot
>> less with this little blemish than if you have a similar but
>> incompatible format *and* the 4.5 thing *on top*.
>
> I'd like to do some more tests how to map e.g. the description classes 
> into W3C. The idea was that the VO-serialisations can be used to 
> exchange the provenance metadata, fully preserving the VO provenance 
> structure. So one could retrieve a provenance serialisation from a 
> ProvDAL service and upload it to a (Prov-)TAP service to investigate 
> it further. I'm willing to drop the VO-serialisations in favour of 
> fully W3C-compatible serialisations, if we can be sure that the 
> original structure can be preserved/recovered. I need to see some more 
> examples/try implementations to verify if this can work - or not.
>
>> Sect 5 -- I'm not sure if I'm terribly happy to have an access protocol
>> folded into this already fairly long document (on the other hand: the
>> less standards the better). 
>
> Yeah, it all just grows with time ... at its birth ProvDAL was just 
> half a page. If it gets too big, it can move to a separate document, 
> if needed.
>
>> But one thing I'm sure about is that you
>> don't want two access protocols.  It'll be hard enough to gain uptake
>> for one of them.  If you let people choose which one to implement, in
>> all likelihood half will use one protocol, and the other half the other,
>> thus at least doubling the implementation effort for client authors.  My
>> take: There are enough TAP engines available that there is no reason not
>> to just use TAP plus a canonical relational DM representation,
>> preferably built according to standard, VO-DML rules.
>
> I like ProvDAL - I think it's fairly straight-forward to implement on 
> the server-side. I implemented it in my django-prototype application 
> for RAVE provenance (it's described in the implementation note, see 
> https://volute.g-vo.org/svn/trunk/projects/dm/provenance/implementation-note/). 
> There were 2 reasons:
> 1) Francois was asking for an example serialisation of my use case
> 2) I wanted W3C compatible serialisations that I can upload to 
> Southampton's ProvStore and share/visualise.
>
> So I needed something that could produce proper serialisations in 
> different formats, allows me to just extract 1 single step backwards 
> (in time) or more; and it should return a reasonable amount of 
> information (e.g. not all the members of a collection, unless I 
> explicitly ask for them). Thus, even if at some point we decide to 
> drop ProvDAL, I would still implement some kind of interface like this.
>
> A for ProvTAP: Mireille and Francois are making good progress 
> implementing it (I hope :-)).
> ProvTAP shall enable users to search in provenance metadata or select 
> datasets based on their provenance.
> However, I fear that it's going to be a nightmare for an ADQL user to 
> write queries for a relational database - one has to do many joins for 
> each step, and it's a recursive process to extract the progenitors of 
> progenitors of an entity, with usually no way to know beforehand how 
> deep one can go.
> All this can be hidden in the client, of course, (and I'm doing these 
> join-queries server-side in my ProvDAL implementation), but then it's 
> not just a (simple) TAP client with a text field for the ADQL query.
>
> Having said that, I still think ProvTAP can be very useful for power 
> users or powerful clients.
> I just think that ProvDAL and ProvTAP are meant for different purposes.
> We have to check if one could really cover everything ProvDAL is 
> supposed to give us using ProvTAP. E.g. the results from a ProvTAP 
> query won't be directly digestible by W3C Provenance tools; one would 
> always need an additional conversion step to get the structure of a 
> valid W3C serialisation format (even just converting the TAP response 
> to JSON would not produce a valid PROV-JSON).
>
>> p. 43 -- "note that the relations wasDerivedFrom..." -- I'll mention
>> that the mere fact that these "optimisations" bite you in protocol
>> design for me is another indication that they shouldn't be put into the
>> standard in the first place.
>
> We'll think about it ...
>
>> Sect 5.3 -- While, as I said, I think having a relational mapping of the
>> model is an excellent idea, sect. 5.3 is not enough for a REC. To make
>> this implementable, you'd at least have to say
>>
>> * what tables, what columns, what column types make up the model -- I'd
>>    hence prefer if appendix B got a bit more comments and went here
>>    (which is not a problem if you dump PROV-VOTable)
>> * perhaps how to discover the ProvTAP service for a given PubDID or
>>    other identifier?
>> * do these guys live in a schema?  Any schema, perhaps, so a single TAP
>>    service can keep multiple, independent provenance stores?
>> * giving a data model identifier for TAPRegExt indicates there can only
>>    be one provenance store per TAP service -- is that really what you
>>    want?  My recommendation for the future is to use URIs in utype
>>    attributes of schema or table elements.
>>
>> And again, I think as much as possible of this should come out of a
>> defined process valid for all VO-DML DMs.
>
> Okay, we need to discuss this in our group as well. I haven't used any 
> schema so far. But yes, I think a TAP service should be able to keep 
> more than one provenance store. I don't get the last point, you lost 
> me there: can you please explain it in more detail - maybe with an 
> example?
>
>> Sect. 6 -- That content doesn't seem to be REC-level material to me; if
>> you write another, note-type, document anyway, why not push it there?
>
> Yes, we could do that. We thought it would be nice to have some real 
> use cases in the document already, but it could be moved to the 
> implementation note.
>
>> General point I:  Currently, my main use case would be:
>>
>>    In a VOTable containing a photometric time series, declare the column
>>    in which the source images are linked
>>
>> This would be immediately actionable by clients and, I think, obviously
>> useful (people configure such functionality by hand in TOPCAT right now,
>> but of course that only scales so far when you start to automate things
>> or venture into less familar data).  From the current document I can't
>> really see how to even start with this.  Granted, I've not put in this
>> use case into your wiki in time.  But I think if you just employed
>> standard VO-DML mapping (and helped bring that out of the door instead
>> of inventing a custom solution), that use case would, more or less, fall
>> into your laps.
>> General point II: An important motivation for modeling provenance in the
>> first place was a standardisation of the plethora of FITS keywords
>> describing ambient conditions and perhaps even instrument telemetry,
>> also with a view to go beyond FITS at some point.  While I can see why
>> the current document cannot provide the necessary vocabularies, I think
>> it would be great if it provided clear guidance as to what would be
>> required to make it happen (i.e., presumably discuss one or more
>> vocabularies).
>
> Ok, we'll look into these 2 points.
>
> Cheers,
> Kristin
>