Featherweight Publishing Registries

Accomazzi, Alberto aaccomazzi at cfa.harvard.edu
Thu Nov 3 22:40:24 CET 2016


Not to pile onto Walter, but my experience and general opinion are similar
to Markus's.

We use OAI-PMH to harvest nightly from arXiv.  Last night's incremental
harvesting (1,657 records retrieved from a database of 1.2M records) took
72 seconds.

We don't run an OAI-PMH server so I can't advise you on what software to
use, but it sounds to me like something is seriously wrong with your
setup/implementation, rather than the level of complexity of OAI-PMH which
is, as Markus says, quite low IMHO.

Anyway, I was curious about what other options we may want to consider, so
I wrote to an old friend in the DL world (and an author of some of the
protocols we are discussing).  Here's what he had to say:

I don't know whether or why the particular perl implementation of OAI is
> slow, I don't see any fundamental reason why it should be. If that were the
> only problem then I might suggest that someone should look at improving its
> performance.
> I do keep the OAI-PMH tools/software list up-to-date when people send me
> information, but I haven't been trying to look for material of check links
> etc.. I don't think there is much active development because, for the most
> part, the tools work and the protocol has been stable for many years.

Having said these things, I suspect that speed and tooling aren't the only
> reasons to question use of OAI-PMH. The protocol is long in the tooth and
> not very webby. My sense is that the time for shoving repurposing Atom
> feeds for every problem is well past, I would not recommend using Atom
> unless your problem really fits what Atom is designed to do.
> Provided the resources you are trying to synchronize are really on the web
> (i.e. have resolvable URIs) then I think that sitemaps are absolutely the
> way to go. What ResourceSync does is provide some additional facilities on
> top of sitemaps that could be used to improve synchronization capabilities
> and efficiency.
> If the resources you are trying to synchronize are not on the web and you
> don't want to change so that they are, then I'm not sure what the best
> solution is. It might be best to stick with OAI-PMH until such time as the
> community is ready for a more natively web approach.


So what I take away from this is that for the case of the VO registry,
OAI-PMH is still the appropriate way to publish / harvest this content, but
we should keep an eye out for new protocols in the coming years (and
possibly consider a more linked-data approach to VO resources?).  Sitemaps
may be attractive, but they only offer a limited amount of the
functionality that OAI-PMH provides and hence will complicate the life of
the maintainers of full registries without offering any additional
advantage at this point.

-- Alberto



On Wed, Nov 2, 2016 at 7:42 AM, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Hi Walter,
>
> On Fri, Oct 28, 2016 at 09:09:55PM -0700, Walter Landry wrote:
> > Just to be clear, Atom feeds are described by an IETF RFC [1], so it
> > is just as standardized as OAI-PMH.  In addition, Atom feed clients
> > are ubiquitous, there are a wide variety of Atom tools, and, of
> > course, Atom has far, far larger adoption than OAI-PMH.
>
> ...but then it does something rather different.  Unless we completely
> overturn the way the Registry has worked, we need both full and
> incremental harvesting, and I can't see how either is possible with
> Atom (where the originating server determines what records it puts
> into its feed, and the harvester has no way of selecting "all",
> "yesterday's", "last week's", or whatever -- right?)
>
> I'm less familar with sitemap, but there, too, it would seem that if
> we want to retain the capability to have incremental harvests, thing
> will become quite a bit more complex than OAI-PMH quickly (I guess
> you'd have to use recursive sitemaps; many sites would have to
> do that anyway because of the 10 MB limit).
>
> So: I believe neither Atom nor Sitemap can serve as a basis for a
> *simplification* of Registry harvesting, if they can be shoehorned to
> work in the place of OAI-PMH at all.
>
> From your other, Fri, 28 Oct 2016 07:37:35 -0700 (PDT), mail:
>
> > Harvesting Vizier's records takes more than a day.  That does not fit
> > my definition of "works well".  IRSA's implementation is also
>
> Nah, not at all.  The whole VO Registry, including VizieR, can
> be fully re-harvested in deal less than an hour (ok, it takes a bit
> longer if you don't use sets=ivo_managed, but few components would
> have a reason to do that).  Incremental harvesting takes minutes at
> worst.  As a registry operator (both ends, publishing and harvesting)
> I'd maintain that it does work well.
>
> > pathetically slow.  We could spend effort to make it fast, but the
> > protocol is overly complicated.  We should not have to run a special
> > service for something this semantically simple.
>
> Again: it's much less than 400 lines of code, which isn't anywhere
> near "overly complex", and "semantically simple" only applies if you
> forget about incremental harvesting, two metadata schemes (if nothing
> else, a political requirement), and sets (which are a nifty feature if
> you want to exchange validation information).
>
> If you want to reduce that (moderate) complexity, fine, but we'd have
> to be honest about what we scupper.
>
> > As another example of a busy site that had problems with OAI-PMH,
> > Google got rid of support for it 8 years ago [1].  I found this
> > quote apropos:
>
> Of course, Google solves a completely different set of problems,
> which is why...
>
> > Sitemaps, RSS, and Atom are all widely implemented, well supported
> > international standards that are much easier to implement.  I would be
>
> ...might work for Google (though I gather they've dropped support for
> RSS and Atom from the majority of their products, too...).  But, as
> stated above, they won't work for the Registry, at least nowhere near
> as a drop-in replacement for OAI-PMH.
>
>
> Anyway, we can talk here all day: It seems, Walter, that OAI-PMH is
> an itch that mainly you feel.  Moving away from it may make sense in
> the long run, but since we definitely don't want to require two
> harvesting protocols to be supported within the VO, this would mean
> going for Registry Interfaces 2.0.  If you'd like to draft something
> going in that direction, you'd of course be welcome.
>
> I'll say right now that I'll be pushing fairly hard to maintain the
> ability to do incremental harvests; ingesting the entire Registry
> involves ingesting and indexing about 1 million relatively complex
> database rows with foreign keys and all, and I don't want to do
> that twice every day or so.  While my implementation could be sped up
> fairly easily, I also note that for me, this ingestion takes about as
> long as the harvesting itself, so even with sub-optimal OAI-PMH
> implementations, harvesting is not a problematic bottleneck at the
> moment.
>
>
> The alternative: Just have another look at OAI-PMH.  It's a
> well-written standard (I wish some of our own standards were as clear
> and exhaustive), was designed with implementation simplicity in mind
> and, I claim, can be implemented in a good afternoon[1], about the
> time you've spent already discussing whether to throw it away.
> There's even a good validator for it.
>
>       -- Markus
>
>
> [1] Ok, there are a few snags (e.g.: IVOIDs are case-insensitive, so
> you'll need extra logic in some spots); but these are, by and large,
> our, the VO's, fault and won't go away by dumping OAI-PMH.
>



-- 
Dr. Alberto Accomazzi
Principal Investigator
NASA Astrophysics Data System - http://ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics - http://www.cfa.harvard.edu
60 Garden St, MS 83, Cambridge, MA 02138, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/registry/attachments/20161103/33a9b1ca/attachment.html>


More information about the registry mailing list