European DataGrid notes

Wed Jun 11 08:19:38 PDT 2003

Greetings.

During one of the Registry discussions at last month's Cambridge meeting,
I mentioned that EDG had done a lot of work on replica management and
metadata servers.  I was asked to find out more from the EDG folk located
in Glasgow, and post here.  So here it is.  This may be more detail
than folk want, but on the off-chance that it's not enough for anyone,
I can find out more locally.

All this is rather confusing, so don't quote me _too_ authoritatively!

All the best,

Norman

EDG for beginners
=================

EDG is the European Data Grid <http://www.edg.org> =
<http://eu-datagrid.web.cern.ch/eu-datagrid/>, and is the main EC-funded
grid project.  It runs over three years, from start 2001 to end 2003,
and is more-or-less on schedule to deliver a large package of software
by then.  EGEE <http://egee-ei.web.cern.ch/egee-ei/New/Home.htm> is due
to work on deployment of the technology EDG develops.

GridPP <http://www.gridpp.ac.uk/> is the UK PPARC-funded particle physics
grid, and is intended to provide PP applications, physical infrastructure,
and middleware for the UK.  It's integrated with EDG but not part of it.
GridPP provides several FTEs of effort toward EDG.

There's a kit of documents for new DataGrid folk at
<http://eu-datagrid.web.cern.ch/eu-datagrid/QAG/DataGrid%20Resource%20kit.htm>,
some of which makes interesting reading.  As well as overview documents,
there are useful process documents here, concerning software release
plans and the like.

There's a variety of WP2 documents at
<http://edg-wp2.web.cern.ch/edg-wp2/documents.html>,
including a collection of recommended pointers at
<http://edg-wp2.web.cern.ch/edg-wp2/readings.html>, which looks valuable
for folk doing related work.

----------------------------------------------------------------------

Structure

EDG includes applications in Particle Physics, Earth Observation and
Biology (WP8-10), management and dissemination (WP11, WP12).  A large
part of the effort in in the domain of middleware, which is what's of
most interest here.

WP1-5 are where the majority of the code development is (as far as
I understand it), and WP2 is the datamanagement work package, which
contains the replica and database effort.

Aside on version numbers: EDG appears to overload the term `testbed' for
some reason.  If you see or hear the term `Testbed n.m', this refers to
release n.m of the EDG software set.  Testbed 2.0 is the upcoming
version, which has most of the functionality, but isn't intended to
be final.

----------------------------------------------------------------------

WP2 Data Management <http://edg-wp2.web.cern.ch/edg-wp2/>:

    The goal of this work package is to specify, develop, integrate
    and test tools and middleware infrastructure to coherently manage
    and share petabyte-scale information volumes in high-throughput
    production quality grid environments. [...]  to move and replicate
    data at high speed from one geographical site to another, and to
    manage synchronisation of remote replicas.

This is specifically about managing replicas of large data volumes, rather
than finding them.

The documents on the WP2 pages are mostly design documents, with only
one or two after 2001.  The usage and API documents are evolving with
the software, and will be released along with that software toward the
end of this year.  Again, as I understand it.

The two important components of WP2, from the VO point of view, are
replication and database abstraction.

----------------------------------------------------------------------

Spitfire: http://edg-wp2.web.cern.ch/edg-wp2/spitfire/

Spitfire is part of WP2, and provides a database abstraction layer.

[Just by the way, I understand that, although folk expected to be
using LDAP for some of this work, it has become clear that it's
_too_ lightweight for most applications, and LDAP has disappeared from
Globus for this reason].

Spitfire appears to be focused on the relatively small quantities of
data and metadata which the (VO) registry is intended to handle.  It's
capable to handling larger data volumes, however.

Description on that page:

    Grid middleware and Grid application software often need access
    to persistent data or need to write data into a persistent
    store. For massive amounts of application data the applications
    will continue to use their own optimized data stores. But for
    short lived, small amounts of data and metadata that needs to be
    highly accessible to many users and applications throughout the
    Grid there is a need for an abstract high-level Grid database
    interface. Without such a service applications and Grid
    middleware services will continue to use dozens of varying and
    incompatible approaches necessitating complex and expensive
    translation and conversion steps.

It's implemented using web-server technology (Tomcat, Axis, JDBC,
SOAP, OGSA), and they've put in effort to make it efficient.

----------------------------------------------------------------------

Replica Management Task: http://edg-wp2.web.cern.ch/edg-wp2/replication/

The replica management is intended to deal with the bulk
data (TB to PB) on a data grid.  The current EDG replica
manager is called Replica Manager (RM, though you might see
references to `Reptor', which was an earlier version), and the
design and implementation of an earlier version is described in
<http://cern.ch/grid-data-management/docs/ReplicaManager/ReptorPaper.pdf>.
It can handle read-only and read-write replicas and, for example,
a variety of different semantics for master copies.  From that document:

    The Replica Management Service (RMS) is a logical single entry
    point for the user to the replica management system. It
    encapsulates the underlying systems and services and provides a
    uniform interface to the user. Users of the RMS may be application
    programmers that require access to certain files, as well as high
    level Grid tools such as scheduling agents, which use the RMS to
    acquire information for their resource optimisation task and file
    transfer execution.

This is currently working code -- it still has some way to go before
it's production quality, but I understand its general features won't now
change substantially.  Though this isn't really the concern of the VO
Registry group, I would imagine that the VO community could adopt these
ideas and implementations fairly easily.  Even for the registry design
work, the ideas worked out in this paper would probably be useful for
thinking about the registry.

RMS appears to deal with the core functions of replica creation, deletion,
and cataloguing.  There's a separate task concerned with optimising
access to replicas, the Replica Optimization Service (ROS), Optor,
<http://edg-wp2.web.cern.ch/edg-wp2/optimization/ros.html>.  That handles
short-term optimization only at present (find the `best' replica at the
time the replica is requested), but aims to do long-term optimisation
eventually (predictive, depending on long-term file access patterns).
I believe this is working, but currently less developed than RM.

RM 1.2 was a thin wrapper round Globus Replica Management, which was
heavily based on GridFTP, and turned out to be too simple-minded.

RM 1.3 is a complete rewrite, independent of Globus Replica
Management, but still able to use gftp amongst other protocols.  This
is approaching a final version, though there are still some decisions
to be made about fat/thin clients, or whether proxies are involved,
and that class of stuff.

-- 
---------------------------------------------------------------------------
Norman Gray                        http://www.astro.gla.ac.uk/users/norman/
Physics and Astronomy, University of Glasgow, UK     norman at astro.gla.ac.uk