Featherweight Publishing Registries

Wed Nov 2 12:42:20 CET 2016

Hi Walter,

On Fri, Oct 28, 2016 at 09:09:55PM -0700, Walter Landry wrote:
> Just to be clear, Atom feeds are described by an IETF RFC [1], so it
> is just as standardized as OAI-PMH.  In addition, Atom feed clients
> are ubiquitous, there are a wide variety of Atom tools, and, of
> course, Atom has far, far larger adoption than OAI-PMH.

...but then it does something rather different.  Unless we completely
overturn the way the Registry has worked, we need both full and
incremental harvesting, and I can't see how either is possible with
Atom (where the originating server determines what records it puts
into its feed, and the harvester has no way of selecting "all",
"yesterday's", "last week's", or whatever -- right?)

I'm less familar with sitemap, but there, too, it would seem that if
we want to retain the capability to have incremental harvests, thing
will become quite a bit more complex than OAI-PMH quickly (I guess
you'd have to use recursive sitemaps; many sites would have to
do that anyway because of the 10 MB limit).

So: I believe neither Atom nor Sitemap can serve as a basis for a
*simplification* of Registry harvesting, if they can be shoehorned to
work in the place of OAI-PMH at all.

>From your other, Fri, 28 Oct 2016 07:37:35 -0700 (PDT), mail:

> Harvesting Vizier's records takes more than a day.  That does not fit
> my definition of "works well".  IRSA's implementation is also

Nah, not at all.  The whole VO Registry, including VizieR, can
be fully re-harvested in deal less than an hour (ok, it takes a bit
longer if you don't use sets=ivo_managed, but few components would
have a reason to do that).  Incremental harvesting takes minutes at
worst.  As a registry operator (both ends, publishing and harvesting)
I'd maintain that it does work well.

> pathetically slow.  We could spend effort to make it fast, but the
> protocol is overly complicated.  We should not have to run a special
> service for something this semantically simple.

Again: it's much less than 400 lines of code, which isn't anywhere
near "overly complex", and "semantically simple" only applies if you
forget about incremental harvesting, two metadata schemes (if nothing
else, a political requirement), and sets (which are a nifty feature if
you want to exchange validation information).

If you want to reduce that (moderate) complexity, fine, but we'd have
to be honest about what we scupper.

> As another example of a busy site that had problems with OAI-PMH,
> Google got rid of support for it 8 years ago [1].  I found this
> quote apropos:

Of course, Google solves a completely different set of problems,
which is why...

> Sitemaps, RSS, and Atom are all widely implemented, well supported
> international standards that are much easier to implement.  I would be

...might work for Google (though I gather they've dropped support for
RSS and Atom from the majority of their products, too...).  But, as
stated above, they won't work for the Registry, at least nowhere near
as a drop-in replacement for OAI-PMH.

Anyway, we can talk here all day: It seems, Walter, that OAI-PMH is
an itch that mainly you feel.  Moving away from it may make sense in
the long run, but since we definitely don't want to require two
harvesting protocols to be supported within the VO, this would mean
going for Registry Interfaces 2.0.  If you'd like to draft something
going in that direction, you'd of course be welcome.

I'll say right now that I'll be pushing fairly hard to maintain the
ability to do incremental harvests; ingesting the entire Registry
involves ingesting and indexing about 1 million relatively complex
database rows with foreign keys and all, and I don't want to do
that twice every day or so.  While my implementation could be sped up
fairly easily, I also note that for me, this ingestion takes about as
long as the harvesting itself, so even with sub-optimal OAI-PMH
implementations, harvesting is not a problematic bottleneck at the
moment.

The alternative: Just have another look at OAI-PMH.  It's a
well-written standard (I wish some of our own standards were as clear
and exhaustive), was designed with implementation simplicity in mind
and, I claim, can be implemented in a good afternoon[1], about the
time you've spent already discussing whether to throw it away.
There's even a good validator for it.

      -- Markus

[1] Ok, there are a few snags (e.g.: IVOIDs are case-insensitive, so
you'll need extra logic in some spots); but these are, by and large,
our, the VO's, fault and won't go away by dumping OAI-PMH.