RWP04: Registry Replication

Fri Apr 25 09:38:56 PDT 2003

I've looked over the exchange between Keith and Will about the registry
contents. 

If we go with Will's idea of every registry holding all information about
every VO resource then I think we need a second tier of registry - one which
only stores limited information, submits to harvesting or push-updates of
some kind but does not implement the query interface. These could be set up
by data centres to maintain their own list of resources which are then
replicated onto the main grid of full registries.

Personally, though, I prefer the approach of allowing intermediate
registries, those which might only harvest from a limited number of groups.
Such registries could offer fast responses to queries from a community which
was primarily interested in certain types of query. But such registries
would also have to cope with queries beyond their answering capability. 

What I would propose then is that the query have a structure like:

<query scope={"all","target","this"} expiry="date time">
  <originator type={"client","registry"}>
    ...
  </originator>
  <target>
    <registryID>
      ...
    </registryID>
    <registryID>
      ...
    </registryID>
    ...
  </target>
  <queryID>
    ...
  </queryID>
  ...actual query...
</query>

(ignore the actual tags - I'm not pre-empting the RegQL discussion)

So a client (portal usually) will submit a registry query and, usually, the
registry will satisfy it (scope="this"). The expiry attribute and the
originator and target tags are optional and ignored if they are included.

The queryID tag is an ID that the client has assigned and is not mandatory
on a local query.

If the client requires deeper scope, then the query becomes asynchronous.
The registry satisfies what it can of the query from its own list of
resources and returns this to the client. It then repackages the query and
redistributes it.

The redistributed query makes use of the expiry tag to indicate a cutoff
date/time beyond which any receiving registries can ignore the query and
delete it - if the client has not supplied this, the registry will add it
before distribution. The queryID tag is mandatory on these queries as is the
originator tag. 

The originator tag identifies where the answers are sent back to: either the
client or the local registry. We would need some means of returning the
answers to the client either way - call back, messaging interface or
whatever.

If the scope="target", the <target> structure identifies the destination
registry(ies). The only valid targets are ones the local registry knows the
address of (all of them if every registry is mandated to keep a full list of
all other registries). The local registry will look up the location and
calling mechanism for these registries and pass the query on to them. In
turn, they will satisfy the query and return the results to the indicated
originator.

The 'target' scope queries could be synchronous but mandating that means we
have to program around the problem of what happens if one or more of the
targets is offline. Easier, I think, to make it asynch in the first place so
the client knows it has to accumulate results.

If the originator was identified as the local registry, it could dedup the
returning results - only sending on resources not already found. And then
the client could poll the local registry for updates. Might be easier?

If scope="all", the registry sends the same query to all the registries on
the VO. So the effect is the same as for 'target' scope. If we do not
mandate that every registry must carry a list of every other registry then
we need some way to put the query onto the VO so that registries will pick
it up and answer it. I'm sure there are algorithms for this.

====================

Actually, I've been talking myself out of this as I type. If a query is
scoped 'all' and there are a dozen or so 'full' registries in the VO, every
one of those will answer with the maximum result set - this will flood the
VO with redundant data. 

Unless there is some way of stopping answers that have already been
generated?

We could pass the results around so they are only ever added to but again
that is adding to network traffic unduly.

====================

*FINAL suggestion!!

How about if we have three types of registry:

1. full: will attempt to maintain a full list of all resources on the VO

2. limited: lists only resources of interest to a specific community

3. private: only lists the resources at that location; not queryable

Type 1 and 2 registries must maintain a full list of all other registries on
the VO. 

A client or portal linked to a 'full' registry always gets back a full set
of results from any query. 

A 'limited' registry offers the query scopes("all","target","this") above
except that the 'all' scope is simply sent to one of the 'full' registries.
The queries can now be synchronous so we can lose the complexity of the
expiry attribute and originator tag (if a target is offline for a 'target'
scope query, it is missed out of the results and the client told of this).
The registry is responsible for deduping the results.

The 'private' registry is the data centre one mentioned above which is only
ever harvested by the other two types of registries and does not implement a
query interface.

If anyone has managed to get this far down, how does this sound?

Cheers,
Tony. 

> -----Original Message-----
> From: Keith Noddle [mailto:ktn at star.le.ac.uk] 
> Sent: 24 April 2003 10:59
> To: IVOA Registry mailing list
> Subject: RWP04: Registry Replication
> ...