Harvesting
Pierre Fernique
fernique at simbad.u-strasbg.fr
Tue Sep 9 05:10:20 PDT 2003
Robert Hanisch wrote:
> Hi Tony et al. Yes, I think the next major thing is the
> harvesting/integration/maintenance of the registries, and we are far enough
> along on the content and structure (a few more tweaks being required) to
> start thinking about the next steps.
>
> We need to come to some -- perhaps temporary -- agreement on the intertwined
> issues of identifiers and mirror services. I would like to ask the CDS
> folks who developed GLU (Pierre Fernique, especially) to comment on this
> issue, as they have a number of years of experience already in managing
> mirror services within the GLU database. Pierre: could you remind us how
> mirror or replica services are denoted in GLU, and how your CDS services
> that utilize GLU decide which of a number of replica services to utilize?
>
Hi all,
It will be difficult to explain GLU mirror management in a few words
without writing a long letter. Look immediately to the end of this
message for my conclusion if you don't want to know the detail.
Regards
Pierre
---
In one word, the GLU allows one to get the proper URL to access a
"resource". Briefly, a GLU entry is just a couple
(Identifier,URL-template). At this level, we describe a way to access
resources. This resource can be a simple HTML page or an access to a
database, in this case, we can prefer to call it : "service".
Eventually, one can add additional information : description, parameter
specifications, type of result as I show in the example below:
<RESOURCE ID="CDS/aladin/Aladin.fr">
<NAME>Aladin.fr</NAME>
<DESCRIPTION>
Aladin at CDS (Strasbourg France)
</DESCRIPTION>
<QUERY>
<URL>
http://aladin.u-strasbg.fr/java/nph-aladin.pl?script=$1
</URL>
<VAR NAME="1">
<DESCRIPTION>Script commands</DESCRIPTION>
</VAR>
</QUERY>
<RESULT>
<CONTENT-TYPE>text/html</CONTENT-TYPE>
</RESULT>
</RESOURCE>
In case of REDUNDANT services (same query, same result: real mirror
sites), the GLU uses a mechanism of INDIRECTION : a special GLU entry
allows one to set together a list of GLU entries as shown below.
<RESOURCE ID="CDS/aladin/Aladin">
<NAME>Aladin</NAME>
<DESCRIPTION>
Aladin sky atlas (script parameters)
</DESCRIPTION>
<MIRRORS>
<MIRROR REF="CDS/aladin/Aladin.fr"/>
<MIRROR REF="CDS/aladin/Aladin.ca"/>
<MIRROR REF="CDS/aladin/Aladin.uk"/>
<MIRROR REF="CDS/aladin/Aladin.jp"/>
<MIRROR REF="CDS/aladin/Aladin.ru"/>
<MIRROR REF="CDS/aladin/Aladin.iucaa"/>
<MIRROR REF="CDS/aladin/Aladin.us"/>
</MIRRORS>
</RESOURCE>
When a GLU user wants to address specifically Aladin in France, it will
use the ID "CDS/aladin/Aladin.fr". And, if the user wants to let the
choice to the GLU system, it will use the ID "CDS/aladin/Aladin".
To make you understand know the GLU determines default services, I have
to explain that the GLU is a set a collaborative GLU daemons. Every
daemon is automatically synchronized witch each other ("continuous"
harvesting). So, in background, EACH daemon will test regularly each
redundant service (call the URL with a default query, look if the result
is mapped by a specific regular expression or atleast if the HTTP result
code is ok, memorize in the GLU entries the time required for this
test). By this way, we obtain a "metric" called "availability". The
smallest will be the default.
<RESOURCE ID="CDS/aladin/Aladin">
<NAME>Aladin</NAME>
<DESCRIPTION>
Aladin sky atlas (script parameters)
</DESCRIPTION>
<MIRRORS>
<MIRROR REF="CDS/aladin/Aladin.fr" AVAILABILITY="1"/>
<MIRROR REF="CDS/aladin/Aladin.ca" AVAILABILITY="13"/>
<MIRROR REF="CDS/aladin/Aladin.uk" AVAILABILITY="11"/>
<MIRROR REF="CDS/aladin/Aladin.jp" AVAILABILITY="21"/>
<MIRROR REF="CDS/aladin/Aladin.ru" AVAILABILITY="11"/>
<MIRROR REF="CDS/aladin/Aladin.iucaa" AVAILABILITY="9999"/>
<MIRROR REF="CDS/aladin/Aladin.us" AVAILABILITY="12"/>
</MIRRORS>
</RESOURCE>
In this case a GLU user using the GLU daemon at Strasbourg will get the
fastest services for Strasbourg, and a GLU user using another GLU daemon
will get the fastest services for this other place.
The last issue is to determine the nearest GLU daemon for a user. To do
that, we consider that the Glu daemon is a redundant service itself. So
the Glu sites are described themselves in the GLU dictionnary.
So, the client GLU library (for example, the java GLU client library
inside Aladin tool) will automatically call a default GLU site, will get
the Glu daemon entries and will test each of them to get the nearest Glu
daemon. After this initialisation step, the GLU daemon used by the user
will be the nearest one.
From our experience, we have added in the GLU system these three
modifications :
- Only GLU entries which have been used recently will be tested to avoid
useless tests
- The test delay can be specified in the GLU entries (some services have
to react faster than others)
- The local sites have a "bonus". It means that the Glu daemon at
Strasbourg will favorize the Strasbourg services (same DNS domain) to
avoid some transient "service swapping" if the local machine is
temporarely a little bit overloaded.
Presently, there is no GLU standard to describe SIMILAR services (not
necessary the same query syntax neither the same result format, or even
the same result - typically USNO-B from VizieR compare to another USNO-B
site -> different default columns...). We can imagine to extend GLU
indirections to handle similar services but up to now, there was no need
to implement this kind of thing. And the function to determine the
"default" site would become more complex, and certainly user dependant
(I want an USNO-B site with a result in VOTable, with UCD, available by
SOAP with query so and so...). Also, we will have to extend the GLU
mechanism to translate automatically the query into the required syntax
if the SIMILAR services do not have the same query
syntax/protocol/parameters (welcome in a SOAP world)
From our GLU experience, I summarize :
1) The choice to let visible both specific location (ex: Simbad at
Harvard) and generic service entry (ex: Simbad) is a very flexible
approach.
2) Real mirror sites are relatively more easy to manage than "similar"
sites. Hopefully, generally there is less similar sites than real mirror
sites (ADS, VizieR....)
3) The determination of a "default" site takes time (about 2 mn for
VizieR). A GLU like approach avoids to jam the user during this time.
4) About ID, as shown in above examples, GLU uses a hierarchical name:
Institute/Service/Id. It's a classical approach to solve the two main
contraints : 1) unicity, 2) independancy. However, some times the 3
levels have been too much (an institute having just one data base
doesn't fell the need for a second and a third level) or not enough. It
could be better to keep hierarchical structure but without this fixed
depth (as DNS domain for example)
At last, I think that the next main problem will be to handle the
granularity issue of the definition levels. For example, how to deal
with VizieR (3000 catalogs behind one registry entry) or ADS (a lots of
journals) compared to one individual catalog server or one journal
server. In our experience, the major data bases are not visible against
the indivual resources.
The solution to describe individually each resource of a major data base
in the registry (for example every VizieR catalogs) can be a first
solution but implies to download lots of meta data knowledge of the
major data bases into the registry system. And the registry system will
have to manage these "hierarchical" meta data (catalogs classes for
example...). And what is the granularity limit ? (a bibliographical
server, or a journal, or an abstract ?).
If you want to play with GLU:
1) One Glu browser site :
http://simbad.u-strasbg.fr/glu/cgi-bin/GluBrowser.pl
to retrieve the above examples, clic on the "aladin" service in
the left frame (6th item) and clic on the 2nd "resource" (called
historically "actions") in the right frame.
2) The Glu client in java :
http://simbad.u-strasbg.fr/registry/Glu.java
3) Some other GLU documentations in the VO context:
http://simbad.u-strasbg.fr/registry/registry.htx
Regards,
Pierre Fernique
More information about the registry
mailing list