VOResource 1.1 rights

Wed Jan 3 11:49:08 CET 2018

Dear Registry Community,

Excuse me for starting the year with an outsized mail.

Here's the short version: Even if you don't read all of this post,
please speak up if you have opinions on how "rights information" --
which may include licensing information or perhaps requirements on
citing a specific paper -- should be represented in the VO.

And if you have use cases in that area, please reply NOW.

What happened so far
--------------------

VOResource 1.0 has a (repeatable) rights element that can contain one
of "public", "secure" (which is supposed to mean that both
unauthenticated and authenticated access is supported), or
"proprietary" (which indicates only authenticated access is
allowed).

This actually stands on the toes of
capability/interface/securityMethod (which is the sanctioned way for
clients to figure out whether and which authentication is needed on a
service; note that it's unused in the VO, too, so far).

It also didn't seem terribly useful, and when reviewing the standard
for VOResource 1.0, I couldn't find anyone actually doing anything
with rights.  To investigate practical usage on the resource
providers' side, run

  select rights, count(*) as ct
  from rr.resource
  group by rights

on a 1.0 RegTAP endpoint (e.g., ivo://esavo/registry; GAVO's RegTAP
is on a draft 1.1 already, see below).  The result currently is:

# rights               ct   
  ""                   2509 
  "licensed under cc-by" 3    
  "licensed under cc-0" 4    
  public               16899

-- so, apart from seven VOResource-1.0 invalid records (which are
mine and already use PR-VOResource-1.1 privileges), there is,
essentially, no information at all.  

Note that, in particular, no current resource repeats the rights
element (which would show up as something like "public#secure" here).

Based on the 2015 version of this finding, I proposed for VOResource
1.1 to simply align VOResource rights with DataCite Rights: the
content is a free text string (so you can put in things like "Cite
1915SPAW.......778E if you use this data"), and there is a rightsURI
attribute to machine-readably convey licensing information.  This is
also nicely separated from the question of authentication, which, as
said above, is the matter of securityMethod.

As in DataCite and VOResource 1.1, rights can be repeated in current
PR-VOResource 1.1.

This was briefly discussed in Cape Town
http://wiki.ivoa.net/internal/IVOA/InterOpMay2016-Reg/draft-notes.pdf,
and wasn't found controversial.

The RegTAP saga
---------------

The opening up of rights content has a dire consequence on RegTAP:
The trick to resolve the 1:n relationship between resource and rights
by just concatenating multiple entries with hash marks doesn't work
any more.  While with a controlled vocabulary,

  public#proprietary

can be uniquely parsed, something like

  See http://foo.bar/baz#quux for citation requirements.

would be split up, which (for humans) obviously is wrong in this
case.  In Shanghai, I proposed some ways to resolve this
(http://wiki.ivoa.net/internal/IVOA/InterOpMay2017-Reg/regtap.pdf,
section 7).  My expectation that nobody would care because VOResource
1.0 rights wasn't used at all seemed correct, so I went ahead and
dared remove rights even in a point release (based on the reasoning
that removing a feature that's not used in practice can't break
anything) when coming up with the RegTAP 1.1 WD.

To avoid having to have another database table, rights and rightsURI
went to rr.res_details.  This seemed reasonable since it seemed
unlikely to me that someone would do large-scale discovery based on
rights and rightsURI.

But: When I rolled out a RegTAP 1.1 WD-compliant registry on
reg.g-vo.org and thus dropped the rights columns, it broke machinery
that actually pulled all rights information.  While this probably
wasn't too useful with VOResource 1.0 rights and its limited
vocabulary, there is a point with future usage, where clients might
want to display strings like the "please cite..." or "published under
the XY license" more or less prominently.

The problem
-----------

With the current RegTAP 1.1 draft, routinely pulling in rights
information is painful, as it needs to be assembled from res_details,
more or less like this:

  ...
  left outer join (
    select ivoid, ivo_string_agg(detail_value, "__separator__") as rights_seq
    from rr.res_detail
    where detail_xpath='/rights'
    group by ivoid
  ) as q
  ...

-- and you'd have to then split along __separator__ on the client
side again.  *If* we expect this thing to be common (and perhaps we
should), this needs to be made simpler.

One thing that would make things more efficient at least on the
server side would be to have a dedicated rr.rights table (with rights
and rightsURI columns).  This wouldn't make the fragment much
prettier, though, since you'd still have to have a subquery and an
aggregation:

  ...
  left outer join (
    select ivoid, ivo_string_agg(rights, "__separator__") as rights_seq
    from rr.rights
    group by ivoid
  ) as q
  ...

I suppose what everyone wants is to just write:

  ... rights, rightsURI...  from rr.resource

in the select clause.

The thing that stands against this is the 1:n relationship between
resource and rights, in the absence of a strict content model.

Possible Solutions
------------------

(1) One easy way out would be to only allow a single rights element
per resource element.  As said above, right now nobody uses more than
one.  We could sneak in a last-minute incompatible change to the
VOResource 1.1 schema or at least recommend to only have one rights
element there and then have rights and rightsURI directly in the
rr.resource table.

But then: DataCite allows more than one rights element.  I'm not sure
why they did it and how much it is actually used, but I'd say the
less we deviate from DataCite the better.  Also, it's a bit late for
any normative change in VOResource 1.1.  And perhaps people actually
need multiple rights elements even in the VO.  Do you?

(2) We could, of course, leave everything as it is now -- it nicely
maps almost all information contained in rights (what's missing is
the link between a specific rightsURI and a specific rights text, but
I can't believe there's a discovery case that would need that).  How
much do people resent the query fragment above?  I suppose in
implementation queries using this pattern will still run fast enough.

(3) We could introduce a rights table.  That would make such queries
a bit more transparent and make me worry less about being able to
scale.

But then introducing a table for a feature we're essentially lacking
credible use cases for doesn't feel right at all.

(4) We could allow multiple rights elements in VOResource but in
RegTAP say only the first one can be used for discovery (i.e., will
be in rr.resource).  This may sound a bit odd, but my suspicion is
that this would actually be the best compromise between the desire of
service operators for maximum flexibility in their declarations and
the desire of client writers to have something simple and
well-defined.  

But we'd need to better understand why people might put in multiple
rights elements before we go there, I suppose.  Would we discard
discovery-relevant information if we only kept the first element in a
sequence of rights elements?

(5) We could have some convention to join mulitple rights and
rightsURI elements into a single string and thus keep queries simple
(e.g., any # in a value needs to be URL-escaped; this will, of
course, need further rules; or we just dump the raw XML literal into
the database).  

That would be ok if clients only want to display the stuff.  It would
be painful, perhaps to the point of making things useless, if people
wanted to do discovery queries in the rights columns.

Any other proposals for how to go on are welcome, of course.

So -- do you have any opinion?  I also take personal mail and will
anonymously summarise to the list if I get any.

Finally, I suppose the biggest problem right now is that we're
missing use cases for the rights element in VOResource.  If you have
any, by all means speak up.

           -- Markus