Featherweight Publishing Registries

Fri Oct 21 10:38:30 CEST 2016

Hi Landry,
I'm not sure that I understand your intention. Do you want to start a 
discussion on a new or alternate registry protocol ? Is your FPR 
proposal should be an alternative to the OAIP solution for non 
publishing VO registries ?
I'm not at all a OAI fan but I think that we have to look carefully 
which impacts can have a such evolution.
Pierre Fernique

Le 20/10/2016 à 20:42, Walter Landry a écrit :
> Hi Everyone,
>
> Here at IRSA, we run our own publishing registry, and it is a giant
> pain.  The standard is rather complex, so we use a pre-packaged perl
> script.  That script is incredibly slow, which means that it takes a
> long time for anyone to harvest our repository.  We recently had to
> change all of our records, and it turned out that the best way to do
> it was to have everyone manually delete and then re-harvest our
> records.  This is way harder than it should be.
>
> So I would like to propose something I call a Featherweight Publishing
> Registry (FPR).  It does not use OAI-PMH.  It uses static files.
> Fetching the FPR URL would return a single html file.  That file would
> have a list of links.  Following those links would return the XML
> document for one (or maybe more) of the services.
>
> As a concrete example, the FPR entry for IRSA would be something like
>
>    http://irsa.ipac.caltech.edu/FPR
>
> Fetching it would give an HTML document with links to other URL's
>
> <!doctype html>
> <html>
>    <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalMPSIT"></a>
>    <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalMXSIT"></a>
>    <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalPSWDB"></a>
>    <a href="http://irsa.ipac.caltech.edu/FPR/2MASS/Catalog/CalScanInfo"></a>
>    ...
> </html>
>
> There is no semantic meaning to the URL's.  They could also be
> completely undescriptive.
>
> <!doctype html>
> <html>
>    <a href="http://irsa.ipac.caltech.edu/xyzzy"></a>
>    <a href="http://irsa.ipac.caltech.edu/zzggy"></a>
>    <a href="http://irsa.ipac.caltech.edu/1bdDlXc"></a>
>    <a href="http://irsa.ipac.caltech.edu/RboG305ntki"></a>
>    ...
> </html>
>
> Fetching those URL's would return the XML registry document for one
> or more services.  This setup makes it so that services can use an
> ordinary link checker to verify that the targets exist.
>
> There is no explicit method for adding or removing services.  If a
> service is not in any of the XML registry documents, it is presumed to
> not exist anymore.
>
> This would greatly simplify creating and deploying a publishing
> registry.  An archive would just have to create some static files.
>
> One objection to this scheme might be that it is wasteful of
> bandwidth.  A harvesting service can not rely on OAI-PMH for
> intelligent updates.  It has to fetch all of the URL's again.
>
> I would argue that the bandwidth used is trivial.  Here at IRSA, we
> have hundreds of services, giving us (I believe) the third largest
> number of services.  If someone harvested our complete registry every
> minute, the bandwidth used would be less than 1% of our total outbound
> bandwidth.  I doubt that, in practice, it would be a burden even for
> CDS, which has something like 30,000 services.
>
> Moreover, the current harvesting services already do a full harvest
> regularly.  I understand that one reason they do not do it more
> frequently is because everyone is using this horribly slow perl
> script.  Static files can be served quickly and easily.
>
> In any event, I will be around during the Interop.  So maybe we can
> discuss this then.
>
> Cheers,
> Walter Landry