New ProvenanceDM working draft released, part II

Fri Oct 13 10:07:55 CEST 2017

Hi DM,

On Thu, Oct 12, 2017 at 11:33:54PM +0200, Kristin Riebe wrote:
> > Table 17, the mapping between ProvenanceDM and DatasetDM labels.
> > Frankly, this makes me weep.  Is it *really* not possible to use a
> > common nomenclature, or rather common types, even between IVOA DMs?  I
> > have a hard time imagining how I'd implement this, in particular with a
> > view to "This list is not complete".  And no, "the serialised versions
> > can be adjusted to the corresponding notation" is not reassuring at all.
> > How on earth is a piece of software supposed to know what "corresponds"
> > to a given request?  My impression: For interoperability with the wider
> > world (W3C), DatasetDM should budge and just use ProvDM classes
> > whereever possible.
> 
> In DatasetDM things are structured differently than in ProvenanceDM. For
> example, we tried to find a good way to integrate Entity/EntityDescription

Well, DatasetDM is still in WD stage, so perhaps it could be
re-structured.  Or, even better, perhaps whereever DatasetDM and
ProvDM have a common domain, the respective classes could be taken
out of DatasetDM, and DatasetDM would just import what ProvDM has?

> with the Dataset-class, but since attributes from Entity and its Description
> as well as curation details (and thus links to Agents) are combined in
> Dataset, we couldn't.
> If we want to "marry" them (and also other
> similar-yet-not-the-same-classes), this for sure would mean major changes in
> both models ...

I'm not so sure about "both".  Since ProvDM largely builds on top of
an established, external standard, I'd always prefer its solutions
over custom, VO-only ones in DatasetDM.

> > Table 19 -- oh bother.  Any chance for a SimDM 2.0 that avoids the
> > duplication of classes?  As far as I can see, there's not terribly much
> > SimDM 1.0 content out there yet, so perhaps a version 2.0 wouldn't break
> > too much at this point, would it?
> 
> There are reasons to have this duplication of classes - e.g if there are
> many (thousands) experiments run that have the same protocol, but with

Oh, no, I wasn't talking about the *Description classes -- I still
find them fishy, but if you've firmly established you can't live
without them, I'm willing to trust that judgement.

What makes me cringe is that, again, two IVOA DMs model pretty much
the same domain and use different classes for them.  What I'd like to
see is that SimDM imports ProvDM's modeling for their modeling of
Provenance-related information.  That would be beneficial all around.
On top of that, that'd be a useful test for ProvDM's scope.

> slightly different parameters. One could argue that this can be just an
> optimisation in the implementation, but when serialising the thousands
> experiments, references to a common protocol instead of replicating all its
> properties can still become very handy. We use the benefit of that also in
> our model.
> 
> > Sect. 4.3, PROV-VOTable -- I'm *really* unhappy that a VO-DML-defined
> > data model defines a "custom" serialisation while the authors of the
> Will the VOTable generated using the VO-DML mapping standard really look
> "entirely different" than what we suggest here?

Well, it depends on what the mapping document will look like in the
end; it's important that actual users contribute to the effort.  I
won't hide that I hope the result of wide community participation
will make this this a lot better (in the sense of: easily
comprehensible, reasonably compact annotation) than it  currently is.

> > Sect 5 -- I'm not sure if I'm terribly happy to have an access protocol
> > folded into this already fairly long document (on the other hand: the
> > less standards the better).
> 
> Yeah, it all just grows with time ... at its birth ProvDAL was just half a
> page. If it gets too big, it can move to a separate document, if needed.
> 
> > But one thing I'm sure about is that you
> > don't want two access protocols.  It'll be hard enough to gain uptake
> > for one of them.  If you let people choose which one to implement, in
> > all likelihood half will use one protocol, and the other half the other,
> > thus at least doubling the implementation effort for client authors.  My
> > take: There are enough TAP engines available that there is no reason not
> > to just use TAP plus a canonical relational DM representation,
> > preferably built according to standard, VO-DML rules.
> 
> I like ProvDAL - I think it's fairly straight-forward to implement on the
> server-side. I implemented it in my django-prototype application for RAVE
> provenance (it's described in the implementation note, see https://volute.g-vo.org/svn/trunk/projects/dm/provenance/implementation-note/).

I'm sure there are good and valid reasons for wanting some "simple"
(though in the end these things tend to become more complex as they
involve) access protocol.

But these have to be weighed against the usability of the whole
system.  And for that, two protocols providing access to the same
data is painful.

Let's take spectra; we currently have SSAP and ObsTAP; an all-VO
search for spectra currently has to use both, as many spectra are
only in SSAP, and some are only in ObsTAP.  So, as a client you'll
have to speak two protocols, with different parameter and response
formats.  Worse: Quite a few spectra are available through both SSAP
and ObsTAP, so a good user interface will try to filter duplicates.
One can hope that our PubDIDs will let you do that, but frankly I'd
rather not have to rely on that.

I propose that if we asked our users, they'd rather not have this
situation, and I submit that if we could start the VO with TAP
available already, we'd not have SSAP, or perhaps SSAP only as a
standardised, thin layer on top of ObsTAP with a guarantee that all
data you get through SSAP is available through ObsTAP, too, and vice
versa.

With provenance, we still have the chance to avoid this
user-unfriendly situation.

> However, I fear that it's going to be a nightmare for an ADQL user to write
> queries for a relational database - one has to do many joins for each step,
> and it's a recursive process to extract the progenitors of progenitors of an
> entity, with usually no way to know beforehand how deep one can go.

Well, I give you that relational queries in tree-like structures tend
to be a bit ugly.  I suppose the implementation of your ProvDAL
prototype could provide hints on just how ugly.

If ADQL really proves severely inadequate, perhaps this is a use case
for introducing a new standard language into TAP -- isn't there,
perhaps, practice outside the VO?  There are tree databases and
related languages out there, and this also feels a bit as it SparQL
might help a lot?

> > Sect 5.3 -- While, as I said, I think having a relational mapping of the
> > model is an excellent idea, sect. 5.3 is not enough for a REC.  To make
> > this implementable, you'd at least have to say
> > 
[...]
> > * giving a data model identifier for TAPRegExt indicates there can only
> >    be one provenance store per TAP service -- is that really what you
> >    want?  My recommendation for the future is to use URIs in utype
> >    attributes of schema or table elements.
> > 
> > And again, I think as much as possible of this should come out of a
> > defined process valid for all VO-DML DMs.
> 
> Okay, we need to discuss this in our group as well. I haven't used any
> schema so far. But yes, I think a TAP service should be able to keep more
> than one provenance store. I don't get the last point, you lost me there:
> can you please explain it in more detail - maybe with an example?

Well, it's one of these registry things.  I suppose I should write a
brief note on this, because it's largely I who's messed this up.

Here's the gist: The current standard pattern to discover TAP
services offering data conforming to a data model (in this context: a
certain set of pre-defined tables) is based on TAPRegExt's dataModel
element, which is a child of capability.  If you check the GAVO DC's
TAP resource record,

  http://dc.zah.uni-heidelberg.de/getRR/__system__/tap/run

you'll see something like:

  <capability 
      standardID="ivo://ivoa.net/std/TAP" 
      xsi:type="tr:TableAccess">
    [...]
    <dataModel 
      ivo-id="ivo://ivoa.net/std/ObsCore#table-1.1">Obscore-1.1</dataModel>
    <dataModel 
      ivo-id="ivo://ivoa.net/std/RegTAP#1.0">Registry 1.0</dataModel>
    <dataModel 
      ivo-id="ivo://org.gavo.dc/std/glots#tables-1.0">GloTS 1.0</dataModel>

-- from which one can work out that this service has an ivoa.obscore
table, the thirteen rr.<bla> tables, and the tables making up GloTS.

That's all nice and shiny as long as these are singletons, i.e.,
there's only one "instance" of the top-level element (in this case a
table or schema).  When I wrote TAPRegExt, the use case was obscore,
which is a singleton, and even with RegTAP the thinko didn't become
apparent.

It was EPNTAP that exposed the modeling error: conformance to a data
model is *not* a property of the service (and hence the capability).
It is a property of a table or schema.  EPNTAP exposed that because
it allowed multiple EPNTAP tables in a single service and thus it's
not enough any more to just say "there's an instance of the DM in
this service" to reliably discover where these tables are.

The solution is of course to put the metadata where it belongs: To
the tables or schemas.  Fortunately, these already are declared in
the registry in the tableset element, and that even has a DM-related
attribute for them.

Hence, we're proposing for EPNTAP (there's no WD yet, but
already quite a few services doing this) to put the ivoid of the
model into the table's utype; in the above RR, you can see this in
the mpc.epn_core table:

  <table>
    <name>mpc.epn_core</name>
    <title> EPN-TAP table for MPC Asteroid Orbital Data</title>
    <description> [...] </description>
    <utype>ivo.//vopdc.obspm/std/epncore#schema-2.0</utype>
    [...]

It is this pattern that I propose to use for DM declaration in the
future (it also yields nice RegTAP queries).

Yeah, I should write a note about this.

        -- Markus