How should the Registry handle mirrors?

Thu Jun 24 06:28:22 PDT 2004

In general, the problem of mirrors has been sidelined as 'too difficult for
now' but is one of the issues for post-2005-demo.

There is a facility to cope with mirrors using the Relationship element with
a relationshipType of 'mirror-of'.

The issue is what constitutes a mirror. If a dataset is a bit-for-bit copy,
then probably yes. But what if one field in one record is changed - is it
still a mirror? And what if only records of a certain type are mirrored? Is
it still a mirror?

What might be a mirror in one context may not be in another. 

Other issues include how almost-mirrors are registered. If one dataset is
'derived-from' another, do you still include both in a particular query? And
if you don't want to query duplicate datasets, how do you say so and how
does the software figure out derivatives since we cannot be sure that a
derivative is marked? And what about derivatives or derivatives...?

The reason we've not worried too much about it up to now is that we don't
have the infrastructure to determine the best mirror to use.

Cheers,
Tony. 

> -----Original Message-----
> From: owner-registry at eso.org [mailto:owner-registry at eso.org] 
> On Behalf Of Clive Page
> Sent: 24 June 2004 12:27
> To: registry at ivoa.net
> Subject: How should the Registry handle mirrors?
> 
> Many important astronomical data resources are duplicated, 
> for example there are 9 copies of the VizieR service, and 
> there are two or more copies of quite a number of other 
> important data resources.  It seems to me quite important 
> that the VO should to be able to handle them sensibly.  This 
> obviously involves the Registry, yet I could not find any 
> reference to mirrors in any registry-related documents that I scanned.
> 
> My first thoughts on functional requirements are set out 
> here. The two principal benefits that the user gets from 
> having mirrored copies are (a) increased availability in the 
> face of services and links which are not guaranteed to be 
> on-line 24/7, and (b) increased performance by choosing the 
> copy with the best network links or smallest existing 
> workload.  I can see, however, that achieving these will be 
> difficult in practice.
> 
> I think that the functionality of the Registry will need to 
> depend on whether it is being queried by a human or by a machine.
> 
> If a human sends a resource discovery query to the registry 
> which finds that two or more identical copies exist of the 
> required resource, ideally the registry should tell the user 
> the best one to use.  Doing this in practice will be very 
> difficult, as it will depend on what subsequent operations 
> the user plans to carry out.  If it is a trivial query 
> returning a large volume of data, then the network link speed 
> should be given a large weight in making the decision, 
> whereas if they want to perform a substantial data mining 
> query, then the current workload of the various servers may 
> be the determining factor.  My guess is that most human users 
> would like to know of the existence of all available mirrors 
> for the resource, so they can take the decision as to which 
> one to use.
> So at the most basic level the Registry should simply list 
> them all. This is, after all, what generally happens at 
> present, e.g. if you access the Vizier home page.
> 
> The VO will provide added value if it can give information on 
> which of the mirrors is currently on-line (e.g. by doing a 
> few pings before returning results), and even more value if 
> it can indicate the nearest in network terms (e.g. by doing a 
> few traceroutes), but this is clearly something that we don't 
> need to provide initially.
> 
> If the query comes from a machine, e.g. as part of a complex 
> workflow, the situation is quite different.  The Registry 
> *has* to choose one copy (or else it will just put off the 
> decision to the workflow engine, which doesn't solve the 
> problem, it just passes the buck).  In this situation it is 
> highly desirable that it chooses a copy of the resource which 
> is actually working, so here some pings will really be 
> needed.  Ideally the Registry might try to maintain an 
> up-to-date table of available resources, but this is surely 
> more advanced functionality than we can contemplate at present.
> 
> I wonder if there is a third case in which the user wants to 
> compare mirrors by sending the same query to two (or more) of 
> them?  This might not be something the system should 
> encourage, but it might be a nice function for data centre 
> administrators to have available, so as to check on the 
> validity of their own mirroring facilities.
> 
> 
> Now a question to Bob and the drafters of the Resource and Service
> Metadata: how do mirrors actually get registered and 
> identified?  Is it sufficient for the Title element to be 
> identical (so all Vizier clones are simply called "Vizier" 
> (or "VizieR"?).  The Identifier (URI) will obviously be 
> different, but what about the ShortName, and the Publisher?
> 
> Given the importance of mirrors in the astronomical data 
> provision, it would be nice if the documentation could give 
> clear guidance on these matters.  We are already starting to 
> see prototype registries being set up, and mistakes at this 
> stage could be hard to unwind later on.
> 
> Apologies all round if these issues have already been 
> explored in the mailing lists, and I've just failed to notice them.
> 
> --
> Clive Page
> Dept of Physics & Astronomy,
> University of Leicester,
> Leicester, LE1 7RH,  U.K.
> 
>