<div dir="ltr">Not to pile onto Walter, but my experience and general opinion are similar to Markus's.<div><br></div><div>We use OAI-PMH to harvest nightly from arXiv. Last night's incremental harvesting (1,657 records retrieved from a database of 1.2M records) took 72 seconds. </div><div><br></div><div>We don't run an OAI-PMH server so I can't advise you on what software to use, but it sounds to me like something is seriously wrong with your setup/implementation, rather than the level of complexity of OAI-PMH which is, as Markus says, quite low IMHO.</div><div><br></div><div>Anyway, I was curious about what other options we may want to consider, so I wrote to an old friend in the DL world (and an author of some of the protocols we are discussing). Here's what he had to say:</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="color:rgb(0,0,0);font-size:12.8px">I don't know whether or why the particular perl implementation of OAI is slow, I don't see any fundamental reason why it should be. If that were the only problem then I might suggest that someone should look at improving its performance.</span><br style="color:rgb(0,0,0);font-size:12.8px"><span style="color:rgb(0,0,0);font-size:12.8px">I do keep the OAI-PMH tools/software list up-to-date when people send me information, but I haven't been trying to look for material of check links etc.. I don't think there is much active development because, for the most part, the tools work and the protocol has been stable for many years.</span></blockquote><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Having said these things, I suspect that speed and tooling aren't the only reasons to question use of OAI-PMH. The protocol is long in the tooth and not very webby. My sense is that the time for shoving repurposing Atom feeds for every problem is well past, I would not recommend using Atom unless your problem really fits what Atom is designed to do.<br>Provided the resources you are trying to synchronize are really on the web (i.e. have resolvable URIs) then I think that sitemaps are absolutely the way to go. What ResourceSync does is provide some additional facilities on top of sitemaps that could be used to improve synchronization capabilities and efficiency.<br>If the resources you are trying to synchronize are not on the web and you don't want to change so that they are, then I'm not sure what the best solution is. It might be best to stick with OAI-PMH until such time as the community is ready for a more natively web approach.</blockquote></div><div><br></div><div>So what I take away from this is that for the case of the VO registry, OAI-PMH is still the appropriate way to publish / harvest this content, but we should keep an eye out for new protocols in the coming years (and possibly consider a more linked-data approach to VO resources?). Sitemaps may be attractive, but they only offer a limited amount of the functionality that OAI-PMH provides and hence will complicate the life of the maintainers of full registries without offering any additional advantage at this point.</div><div><br></div><div>-- Alberto</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 2, 2016 at 7:42 AM, Markus Demleitner <span dir="ltr"><<a href="mailto:msdemlei@ari.uni-heidelberg.de" target="_blank">msdemlei@ari.uni-heidelberg.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Walter,<br>
<span class=""><br>
On Fri, Oct 28, 2016 at 09:09:55PM -0700, Walter Landry wrote:<br>
> Just to be clear, Atom feeds are described by an IETF RFC [1], so it<br>
> is just as standardized as OAI-PMH. In addition, Atom feed clients<br>
> are ubiquitous, there are a wide variety of Atom tools, and, of<br>
> course, Atom has far, far larger adoption than OAI-PMH.<br>
<br>
</span>...but then it does something rather different. Unless we completely<br>
overturn the way the Registry has worked, we need both full and<br>
incremental harvesting, and I can't see how either is possible with<br>
Atom (where the originating server determines what records it puts<br>
into its feed, and the harvester has no way of selecting "all",<br>
"yesterday's", "last week's", or whatever -- right?)<br>
<br>
I'm less familar with sitemap, but there, too, it would seem that if<br>
we want to retain the capability to have incremental harvests, thing<br>
will become quite a bit more complex than OAI-PMH quickly (I guess<br>
you'd have to use recursive sitemaps; many sites would have to<br>
do that anyway because of the 10 MB limit).<br>
<br>
So: I believe neither Atom nor Sitemap can serve as a basis for a<br>
*simplification* of Registry harvesting, if they can be shoehorned to<br>
work in the place of OAI-PMH at all.<br>
<br>
>From your other, Fri, 28 Oct 2016 07:37:35 -0700 (PDT), mail:<br>
<span class=""><br>
> Harvesting Vizier's records takes more than a day. That does not fit<br>
> my definition of "works well". IRSA's implementation is also<br>
<br>
</span>Nah, not at all. The whole VO Registry, including VizieR, can<br>
be fully re-harvested in deal less than an hour (ok, it takes a bit<br>
longer if you don't use sets=ivo_managed, but few components would<br>
have a reason to do that). Incremental harvesting takes minutes at<br>
worst. As a registry operator (both ends, publishing and harvesting)<br>
I'd maintain that it does work well.<br>
<span class=""><br>
> pathetically slow. We could spend effort to make it fast, but the<br>
> protocol is overly complicated. We should not have to run a special<br>
> service for something this semantically simple.<br>
<br>
</span>Again: it's much less than 400 lines of code, which isn't anywhere<br>
near "overly complex", and "semantically simple" only applies if you<br>
forget about incremental harvesting, two metadata schemes (if nothing<br>
else, a political requirement), and sets (which are a nifty feature if<br>
you want to exchange validation information).<br>
<br>
If you want to reduce that (moderate) complexity, fine, but we'd have<br>
to be honest about what we scupper.<br>
<span class=""><br>
> As another example of a busy site that had problems with OAI-PMH,<br>
> Google got rid of support for it 8 years ago [1]. I found this<br>
> quote apropos:<br>
<br>
</span>Of course, Google solves a completely different set of problems,<br>
which is why...<br>
<span class=""><br>
> Sitemaps, RSS, and Atom are all widely implemented, well supported<br>
> international standards that are much easier to implement. I would be<br>
<br>
</span>...might work for Google (though I gather they've dropped support for<br>
RSS and Atom from the majority of their products, too...). But, as<br>
stated above, they won't work for the Registry, at least nowhere near<br>
as a drop-in replacement for OAI-PMH.<br>
<br>
<br>
Anyway, we can talk here all day: It seems, Walter, that OAI-PMH is<br>
an itch that mainly you feel. Moving away from it may make sense in<br>
the long run, but since we definitely don't want to require two<br>
harvesting protocols to be supported within the VO, this would mean<br>
going for Registry Interfaces 2.0. If you'd like to draft something<br>
going in that direction, you'd of course be welcome.<br>
<br>
I'll say right now that I'll be pushing fairly hard to maintain the<br>
ability to do incremental harvests; ingesting the entire Registry<br>
involves ingesting and indexing about 1 million relatively complex<br>
database rows with foreign keys and all, and I don't want to do<br>
that twice every day or so. While my implementation could be sped up<br>
fairly easily, I also note that for me, this ingestion takes about as<br>
long as the harvesting itself, so even with sub-optimal OAI-PMH<br>
implementations, harvesting is not a problematic bottleneck at the<br>
moment.<br>
<br>
<br>
The alternative: Just have another look at OAI-PMH. It's a<br>
well-written standard (I wish some of our own standards were as clear<br>
and exhaustive), was designed with implementation simplicity in mind<br>
and, I claim, can be implemented in a good afternoon[1], about the<br>
time you've spent already discussing whether to throw it away.<br>
There's even a good validator for it.<br>
<br>
-- Markus<br>
<br>
<br>
[1] Ok, there are a few snags (e.g.: IVOIDs are case-insensitive, so<br>
you'll need extra logic in some spots); but these are, by and large,<br>
our, the VO's, fault and won't go away by dumping OAI-PMH.<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Dr. Alberto Accomazzi<br>Principal Investigator</div><div>NASA Astrophysics Data System - <a href="http://ads.harvard.edu" target="_blank">http://ads.harvard.edu</a><br>Harvard-Smithsonian Center for Astrophysics - <a href="http://www.cfa.harvard.edu" target="_blank">http://www.cfa.harvard.edu</a><br>60 Garden St, MS 83, Cambridge, MA 02138, USA</div></div></div>
</div>