Harvesting

Tue Sep 9 05:10:20 PDT 2003

Robert Hanisch wrote:
> Hi Tony et al.  Yes, I think the next major thing is the
> harvesting/integration/maintenance of the registries, and we are far enough
> along on the content and structure (a few more tweaks being required) to
> start thinking about the next steps.
> 
> We need to come to some -- perhaps temporary -- agreement on the intertwined
> issues of identifiers and mirror services.  I would like to ask the CDS
> folks who developed GLU (Pierre Fernique, especially) to comment on this
> issue, as they have a number of years of experience already in managing
> mirror services within the GLU database.  Pierre:  could you remind us how
> mirror or replica services are denoted in GLU, and how your CDS services
> that utilize GLU decide which of a number of replica services to utilize?
> 

Hi all,

It will be difficult to explain GLU mirror management in a few words 
without writing a long letter. Look immediately to the end of this 
message for my conclusion if you don't want to know the detail.

Regards
Pierre

---

In one word, the GLU allows one to get the proper URL to access a 
"resource". Briefly, a GLU entry is just a couple 
(Identifier,URL-template). At this level, we describe a way to access 
resources. This resource can be a simple HTML page or an access to a 
database, in this case, we can prefer to call it : "service". 
Eventually, one can add additional information : description, parameter 
specifications, type of result as I show in the example below:

    <RESOURCE ID="CDS/aladin/Aladin.fr">
       <NAME>Aladin.fr</NAME>
       <DESCRIPTION>
          Aladin at CDS (Strasbourg France)
       </DESCRIPTION>
       <QUERY>
          <URL>
             http://aladin.u-strasbg.fr/java/nph-aladin.pl?script=$1
          </URL>
          <VAR NAME="1">
             <DESCRIPTION>Script commands</DESCRIPTION>
          </VAR>
       </QUERY>
       <RESULT>
          <CONTENT-TYPE>text/html</CONTENT-TYPE>
       </RESULT>
    </RESOURCE>

In case of REDUNDANT services (same query, same result: real mirror 
sites), the GLU uses a mechanism of INDIRECTION : a special GLU entry 
allows one to set together a list of GLU entries as shown below.

    <RESOURCE ID="CDS/aladin/Aladin">
       <NAME>Aladin</NAME>
       <DESCRIPTION>
          Aladin sky atlas (script parameters)
       </DESCRIPTION>
       <MIRRORS>
          <MIRROR REF="CDS/aladin/Aladin.fr"/>
          <MIRROR REF="CDS/aladin/Aladin.ca"/>
          <MIRROR REF="CDS/aladin/Aladin.uk"/>
          <MIRROR REF="CDS/aladin/Aladin.jp"/>
          <MIRROR REF="CDS/aladin/Aladin.ru"/>
          <MIRROR REF="CDS/aladin/Aladin.iucaa"/>
          <MIRROR REF="CDS/aladin/Aladin.us"/>
       </MIRRORS>
    </RESOURCE>

When a GLU user wants to address specifically Aladin in France, it will 
use the ID "CDS/aladin/Aladin.fr". And, if the user wants to let the 
choice to the GLU system, it will use the ID "CDS/aladin/Aladin".

To make you understand know the GLU determines default services, I have 
to explain that the GLU is a set a collaborative GLU daemons. Every 
daemon is automatically synchronized witch each other ("continuous" 
harvesting). So, in background, EACH daemon will test regularly each 
redundant service (call the URL with a default query, look if the result 
is mapped by a specific regular expression or atleast if the HTTP result 
code is ok, memorize in the GLU entries the time required for this 
test). By this way, we obtain a "metric" called "availability". The 
smallest will be the default.

    <RESOURCE ID="CDS/aladin/Aladin">
       <NAME>Aladin</NAME>
       <DESCRIPTION>
          Aladin sky atlas (script parameters)
       </DESCRIPTION>
       <MIRRORS>
          <MIRROR REF="CDS/aladin/Aladin.fr" AVAILABILITY="1"/>
          <MIRROR REF="CDS/aladin/Aladin.ca" AVAILABILITY="13"/>
          <MIRROR REF="CDS/aladin/Aladin.uk" AVAILABILITY="11"/>
          <MIRROR REF="CDS/aladin/Aladin.jp" AVAILABILITY="21"/>
          <MIRROR REF="CDS/aladin/Aladin.ru" AVAILABILITY="11"/>
          <MIRROR REF="CDS/aladin/Aladin.iucaa" AVAILABILITY="9999"/>
          <MIRROR REF="CDS/aladin/Aladin.us" AVAILABILITY="12"/>
       </MIRRORS>
    </RESOURCE>

In this case a GLU user using the GLU daemon at Strasbourg will get the 
fastest services for Strasbourg, and a GLU user using another GLU daemon 
will get the fastest services for this other place.

The last issue is to determine the nearest GLU daemon for a user. To do 
that, we consider that the Glu daemon is a redundant service itself. So 
the Glu sites are described themselves in the GLU dictionnary.
So, the client GLU library (for example, the java GLU client library 
inside Aladin tool) will automatically call a default GLU site, will get 
the Glu daemon entries and will test each of them to get the nearest Glu 
daemon. After this initialisation step, the GLU daemon used by the user 
will be the nearest one.

 From our experience, we have added in the GLU system these three 
modifications :

- Only GLU entries which have been used recently will be tested to avoid 
useless tests
- The test delay can be specified in the GLU entries (some services have 
to react faster than others)
- The local sites have a "bonus". It means that the Glu daemon at 
Strasbourg will favorize the Strasbourg services (same DNS domain) to 
avoid some transient "service swapping" if the local machine is 
temporarely a little bit overloaded.

Presently, there is no GLU standard to describe SIMILAR services (not 
necessary the same query syntax neither the same result format, or even 
the same result - typically USNO-B from VizieR compare to another USNO-B 
site -> different default columns...). We can imagine to extend GLU 
indirections to handle similar services but up to now, there was no need 
to implement this kind of thing. And the function to determine the 
"default" site would become more complex, and certainly user dependant 
(I want an USNO-B site with a result in VOTable, with UCD, available by 
SOAP with query so and so...). Also, we will have to extend the GLU 
mechanism to translate automatically the query into the required syntax 
if the SIMILAR services do not have the same query 
syntax/protocol/parameters (welcome in a SOAP world)

 From our GLU experience, I summarize :
1) The choice to let visible both specific location (ex: Simbad at 
Harvard) and generic service entry (ex: Simbad) is a very flexible
approach.
2) Real mirror sites are relatively more easy to manage than "similar" 
sites. Hopefully, generally there is less similar sites than real mirror 
sites (ADS, VizieR....)
3) The determination of a "default" site takes time (about 2 mn for 
VizieR). A GLU like approach avoids to jam the user during this time.
4) About ID, as shown in above examples, GLU uses a hierarchical name: 
Institute/Service/Id. It's a classical approach to solve the two main 
contraints : 1) unicity, 2) independancy. However, some times the 3 
levels have been too much (an institute having just one data base 
doesn't fell the need for a second and a third level) or not enough. It 
could be better to keep hierarchical structure but without this fixed 
depth (as DNS domain for example)

At last, I think that the next main problem will be to handle the 
granularity issue of the definition levels. For example, how to deal 
with VizieR (3000 catalogs behind one registry entry) or ADS (a lots of 
journals) compared to one individual catalog server or one journal 
server. In our experience, the major data bases are not visible against 
the indivual resources.
The solution to describe individually each resource of a major data base 
in the registry (for example every VizieR catalogs) can be a first 
solution but implies to download lots of meta data knowledge of the 
major data bases into the registry system. And the registry system will 
have to manage these "hierarchical" meta data (catalogs classes for 
example...). And what is the granularity limit ? (a bibliographical 
server, or a journal, or an abstract ?).

If you want to play with GLU:
1) One Glu browser site :
	http://simbad.u-strasbg.fr/glu/cgi-bin/GluBrowser.pl
	to retrieve the above examples, clic on the "aladin" service in
	the left frame (6th item) and clic on the 2nd "resource" (called
	historically "actions") in the right frame.
2) The Glu client in java :
	http://simbad.u-strasbg.fr/registry/Glu.java
3) Some other GLU documentations in the VO context:
	http://simbad.u-strasbg.fr/registry/registry.htx	

Regards,
Pierre Fernique