How should the Registry handle mirrors?

Pierre Fernique fernique at simbad.u-strasbg.fr
Thu Jun 24 07:00:03 PDT 2004


Clive Page wrote:
> Many important astronomical data resources are duplicated, for example
> there are 9 copies of the VizieR service, and there are two or more copies
> of quite a number of other important data resources.  It seems to me quite
> important that the VO should to be able to handle them sensibly.  This
> obviously involves the Registry, yet I could not find any reference to
> mirrors in any registry-related documents that I scanned.
> 
> My first thoughts on functional requirements are set out here. The two
> principal benefits that the user gets from having mirrored copies are (a)
> increased availability in the face of services and links which are not
> guaranteed to be on-line 24/7, and (b) increased performance by choosing
> the copy with the best network links or smallest existing workload.  I can
> see, however, that achieving these will be difficult in practice.
> 
> I think that the functionality of the Registry will need to depend on
> whether it is being queried by a human or by a machine.
> 
> If a human sends a resource discovery query to the registry which finds
> that two or more identical copies exist of the required resource, ideally
> the registry should tell the user the best one to use.  Doing this in
> practice will be very difficult, as it will depend on what subsequent
> operations the user plans to carry out.  If it is a trivial query
> returning a large volume of data, then the network link speed should be
> given a large weight in making the decision, whereas if they want to
> perform a substantial data mining query, then the current workload of the
> various servers may be the determining factor.  My guess is that most
> human users would like to know of the existence of all available mirrors
> for the resource, so they can take the decision as to which one to use.
> So at the most basic level the Registry should simply list them all. This
> is, after all, what generally happens at present, e.g. if you access the
> Vizier home page.
> 
> The VO will provide added value if it can give information on which of the
> mirrors is currently on-line (e.g. by doing a few pings before returning
> results), and even more value if it can indicate the nearest in network
> terms (e.g. by doing a few traceroutes), but this is clearly something
> that we don't need to provide initially.
> 
> If the query comes from a machine, e.g. as part of a complex workflow, the
> situation is quite different.  The Registry *has* to choose one copy (or
> else it will just put off the decision to the workflow engine, which
> doesn't solve the problem, it just passes the buck).  In this situation it
> is highly desirable that it chooses a copy of the resource which is
> actually working, so here some pings will really be needed.  Ideally the
> Registry might try to maintain an up-to-date table of available resources,
> but this is surely more advanced functionality than we can contemplate at
> present.
> 
> I wonder if there is a third case in which the user wants to compare
> mirrors by sending the same query to two (or more) of them?  This might
> not be something the system should encourage, but it might be a nice
> function for data centre administrators to have available, so as to check
> on the validity of their own mirroring facilities.
> 
> 
> Now a question to Bob and the drafters of the Resource and Service
> Metadata: how do mirrors actually get registered and identified?  Is it
> sufficient for the Title element to be identical (so all Vizier clones are
> simply called "Vizier" (or "VizieR"?).  The Identifier (URI) will
> obviously be different, but what about the ShortName, and the Publisher?
> 
> Given the importance of mirrors in the astronomical data provision, it
> would be nice if the documentation could give clear guidance on these
> matters.  We are already starting to see prototype registries being set
> up, and mistakes at this stage could be hard to unwind later on.
> 
> Apologies all round if these issues have already been explored in the
> mailing lists, and I've just failed to notice them.
> 


Dear all,

I share the Clive Pages's questions about the mirror sites.
It was a major challenge when we deployed the GLU system and I imagine 
that the VO registry will encounter the same reality.

Our experience shown us that a simple ping is generally not enough (nor 
a simple HTTP code result) to select the good mirror site. In fact, a 
default of service (disk full, httpd bad configuration,...) occures 
really more often than a general network failure. Also, the ICMP 
protocol (ping) is not available anywhere. So we needed a more 
sophisticated test.
Actually, we use regular expressions to match HTML result tests. In this 
condition we assume that the service is working and also the tests will 
be matched eventual new releases of resources

Each glu registry site memorizes the time required for each test in 
order to determine the "best" site.

However, there can be a lot of tests :
Nb_of_mirrored_resources * Nb_of_tests_per_day * Nb_of_registry_sites

We adopted three solutions:
	- Each glu site tests only the resources recently used (in the
           last 15 days)
	- The GLU metadata structure allow us to "factorize" the tests.
           For example, there is only one test per VizieR site even if
           there are thousand concerned resources. In practice, a
	  catalog resource has an URL with a prefixe. This prefix
           is depending of another registry resource : "VizieRPrefix".
           Only this resource (with its mirror sites) will be tested.
         - The test delay is determined by the resource manager himself
	  (1/hour, 1/day, 1/month...) for each resource.
           And each GLU registry manager can set a upper and lower limit
           for these tests.

The main problem that we're encountering is to characterize "similar" 
resources (not mirrored resources). For example, there are several DSS 
servers. Some of them have only the DSS1 surveys, other not all DSS2 
colors. Some of them can provide cut outs...
And in practice, these "similar" resources introduce a lot of confusions 
in a registry data base.

I know that we already discussed about mirror sites several months ago, 
but I'm not sure that the current registry specifications have really 
handle this reality.

Regards,
Pierre Fernique





More information about the registry mailing list