Featherweight Publishing Registries

Thu Oct 20 20:42:21 CEST 2016

Hi Everyone,

Here at IRSA, we run our own publishing registry, and it is a giant
pain.  The standard is rather complex, so we use a pre-packaged perl
script.  That script is incredibly slow, which means that it takes a
long time for anyone to harvest our repository.  We recently had to
change all of our records, and it turned out that the best way to do
it was to have everyone manually delete and then re-harvest our
records.  This is way harder than it should be.

So I would like to propose something I call a Featherweight Publishing
Registry (FPR).  It does not use OAI-PMH.  It uses static files.
Fetching the FPR URL would return a single html file.  That file would
have a list of links.  Following those links would return the XML
document for one (or maybe more) of the services.

As a concrete example, the FPR entry for IRSA would be something like

  http://irsa.ipac.caltech.edu/FPR

Fetching it would give an HTML document with links to other URL's

<!doctype html>
<html>
  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalMPSIT"></a>
  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalMXSIT"></a>
  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalPSWDB"></a>
  <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalScanInfo"></a>
  ...
</html>

There is no semantic meaning to the URL's.  They could also be
completely undescriptive.

<!doctype html>
<html>
  <a href="http://irsa.ipac.caltech.edu/xyzzy"></a>
  <a href="http://irsa.ipac.caltech.edu/zzggy"></a>
  <a href="http://irsa.ipac.caltech.edu/1bdDlXc"></a>
  <a href="http://irsa.ipac.caltech.edu/RboG305ntki"></a>
  ...
</html>

Fetching those URL's would return the XML registry document for one
or more services.  This setup makes it so that services can use an
ordinary link checker to verify that the targets exist.

There is no explicit method for adding or removing services.  If a
service is not in any of the XML registry documents, it is presumed to
not exist anymore.

This would greatly simplify creating and deploying a publishing
registry.  An archive would just have to create some static files.

One objection to this scheme might be that it is wasteful of
bandwidth.  A harvesting service can not rely on OAI-PMH for
intelligent updates.  It has to fetch all of the URL's again.

I would argue that the bandwidth used is trivial.  Here at IRSA, we
have hundreds of services, giving us (I believe) the third largest
number of services.  If someone harvested our complete registry every
minute, the bandwidth used would be less than 1% of our total outbound
bandwidth.  I doubt that, in practice, it would be a burden even for
CDS, which has something like 30,000 services.

Moreover, the current harvesting services already do a full harvest
regularly.  I understand that one reason they do not do it more
frequently is because everyone is using this horribly slow perl
script.  Static files can be served quickly and easily.

In any event, I will be around during the Interop.  So maybe we can
discuss this then.

Cheers,
Walter Landry