Featherweight Publishing Registries

Sarah Weissman sweissman at stsci.edu
Fri Oct 21 17:33:12 CEST 2016


Do you know why the script is so slow? Is it because of an implementation
flaw or is it because of the self-imposed Retry-after wait period that is
built into the protocol? Or if you are storing all of your records as
files on disk is it because of an IO bottleneck? I agree that the protocol
is complicated, but it seems like there is no reason that transferring
data via OAI-PMH should be much slower than any other protocol for passing
data as XML records.

If you are proposing to switch to a model where each registry returns a
feed of all its entries, without operations for subselecting based on
dates for example, then I would suggest looking into using Atom
syndication https://validator.w3.org/feed/docs/atom.html, which seems to
be designed for exactly this purpose and is already an accepted and widely
used standard on the web.

-Sarah

On 10/20/16, 2:42 PM, "registry-bounces at ivoa.net on behalf of Walter
Landry" <registry-bounces at ivoa.net on behalf of wlandry at caltech.edu> wrote:

>Hi Everyone,
>
>Here at IRSA, we run our own publishing registry, and it is a giant
>pain.  The standard is rather complex, so we use a pre-packaged perl
>script.  That script is incredibly slow, which means that it takes a
>long time for anyone to harvest our repository.  We recently had to
>change all of our records, and it turned out that the best way to do
>it was to have everyone manually delete and then re-harvest our
>records.  This is way harder than it should be.
>
>So I would like to propose something I call a Featherweight Publishing
>Registry (FPR).  It does not use OAI-PMH.  It uses static files.
>Fetching the FPR URL would return a single html file.  That file would
>have a list of links.  Following those links would return the XML
>document for one (or maybe more) of the services.
>
>As a concrete example, the FPR entry for IRSA would be something like
>
>  http://irsa.ipac.caltech.edu/FPR
>
>Fetching it would give an HTML document with links to other URL's
>
><!doctype html>
><html>
>  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalMPSIT"></a>
>  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalMXSIT"></a>
>  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalPSWDB"></a>
>  <a 
>href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalScanInfo"></a>
>  ...
></html>
>
>There is no semantic meaning to the URL's.  They could also be
>completely undescriptive.
>
><!doctype html>
><html>
>  <a href="http://irsa.ipac.caltech.edu/xyzzy"></a>
>  <a href="http://irsa.ipac.caltech.edu/zzggy"></a>
>  <a href="http://irsa.ipac.caltech.edu/1bdDlXc"></a>
>  <a href="http://irsa.ipac.caltech.edu/RboG305ntki"></a>
>  ...
></html>
>
>Fetching those URL's would return the XML registry document for one
>or more services.  This setup makes it so that services can use an
>ordinary link checker to verify that the targets exist.
>
>There is no explicit method for adding or removing services.  If a
>service is not in any of the XML registry documents, it is presumed to
>not exist anymore.
>
>This would greatly simplify creating and deploying a publishing
>registry.  An archive would just have to create some static files.
>
>One objection to this scheme might be that it is wasteful of
>bandwidth.  A harvesting service can not rely on OAI-PMH for
>intelligent updates.  It has to fetch all of the URL's again.
>
>I would argue that the bandwidth used is trivial.  Here at IRSA, we
>have hundreds of services, giving us (I believe) the third largest
>number of services.  If someone harvested our complete registry every
>minute, the bandwidth used would be less than 1% of our total outbound
>bandwidth.  I doubt that, in practice, it would be a burden even for
>CDS, which has something like 30,000 services.
>
>Moreover, the current harvesting services already do a full harvest
>regularly.  I understand that one reason they do not do it more
>frequently is because everyone is using this horribly slow perl
>script.  Static files can be served quickly and easily.
>
>In any event, I will be around during the Interop.  So maybe we can
>discuss this then.
>
>Cheers,
>Walter Landry



More information about the registry mailing list