Multi-conference report: VO and SW
Reagan Moore
moore at sdsc.edu
Mon Dec 12 08:31:48 PST 2005
Norman:
The scale of the sky survey collection size drives the preferred
choice of data management technology. At SDSC, we have been
replicating large sky surveys onto the Teragrid to support large
scale analyses. We use the following mechanisms to do this:
- install data grid servers at the sites that want to replicate their
data. A data grid server is software that runs at the application
level, and corresponds to the VOSpace/VOStore access software.
VOStore is intended to provide access to the storage, while VOSpace
is intended to manage the name spaces.
- register the existing images into the data grid (VOSpace). VOSpace
manages a logical name space for each image, independently of the
naming convention used at the original site. The logical name space
can be organized as a collection hierarchy, metadata can be
associated with each image to support browsing and discovery.
- separately replicate images onto remote systems. In our case, we
replicate data onto disk that is directly accessible by the Teragrid.
- port NVO services on top of the data grid. The same access
mechanisms will work on either the original site or the Teragrid.
This brings up the issue of naming convention that will be uniform
across all storage systems. Images within a data grid can be labeled
with a standard URI, or a Globally Unique ID (GUID), or a handle
(Object ID), or descriptive metadata, or a logical name under which
the data grid organizes images. We usually manage 4 name spaces for
data within the SRB data grid (physical name, logical name, GUID,
desriptive metadata), but can easily add other naming conventions.
The data grid resolves which replica to use based on expected access
performance. The transport mechanisms are designed to support
parallel I/O streams for large files, bulk transport of small files,
and interactions with firewalls. Single images could be moved with
any desired protocol. One of the purposes of the data grid is to map
from the protocols used by storage systems to the protocols used by
user-preferred clients. You can choose to move data through a web
interface using HTTP, or through bulk transfer mechanisms using
parallel I/O streams.
An example is the data grid technology installed by NOAO to manage
data movement from telescopes in Chile to Tucson, and the
organization of images in a persistent archive for long term storage.
Reagan Moore
SDSC
>Doug,
>
>On 2005 Dec 8 , at 21.42, Doug Tody wrote:
>
>>A given URI may resolve into multiple URLs pointing to multiple instances.
>
>That's the difference! I had completely forgotten about the
>one-to-many resolution.
>
>I'm working this through out loud here, Doug, for my benefit rather
>than yours, as I imagine you've been through this already, and
>because it might be useful (to me if noone else) to have the whole
>argument in one place.
>
>The underlying reason is that the resources in question are biggish.
>This breaks the assumptions of the best practice/architecture
>analysis in two independent ways:
>
>1. The resources are replicated, and large enough that the client's
>location on the network matters.
>
>2. The size means that HTTP is probably not the best transport
>mechanism, but instead GridFTP, or BitTorrent, or something else.
>
>In both cases, the client can't be expected to make a good decision
>about which source to use (because that will depend on details of
>the national and intercontinental network, which will moreover
>change in time), nor which protocol to use (which will also depend
>on network environment and time). A local resolver can be expected
>to know these things, either by discovery or configuration.
>
>The assumption that's broken is the single, almost hidden,
>assumption that the transport issue is solved -- `use HTTP'. Even
>if that were sorted out, and everyone decided that GridFTP (say) was
>the single best transport, the analysis also assumes that there is a
>single source -- a single DNS host -- for the resource; the
>replication in (1) means that we're not assuming that. That can
>also be got around, by having a DNS name be handled by multiple
>geographically dispersed IP addresses (Google is well known to do
>this), but this is technically complicated and therefore fragile,
>and also centralised.
>
>Even if they acknowledge the first HTTP point, the response to this
>second point on the part of the TAG (the W3C Technical Architecture
>Group, authors of the Web Architecture document) would be to point
>at the replication implicit in (1). One of the good features of
>HTTP is that it is stateless, which means that it is very friendly
>to caches and proxies, so you _can_ have a simple single source, and
>just rely on caches to speed things up -- don't try to outsmart the
>network!
>But the sizes undermine that argument, too: few places have the
>resources to cache lots of multi-GB files, and if regional or
>national centres were set up which could handle that, it would
>require configuration cleverness to use them. Thus the replication
>is essentially a type of preemptive caching.
>
>On the other hand: I suppose there is still one case for using HTTP
>with a (nominally) single source, along with a smart local proxy,
>which spots when you're requesting a resource/source it knows about,
>and satisfies those requests using (transparently) a separate
>network of replicas and protocols. That way, the client gets all
>the simplicity, predictability and API advantages of using HTTP
>naively (because that would work fine over a local network). The
>proxy is effectively acting as a resolver, but the client is
>interacting with it using an extremely simple and possibly built-in
>protocol/API, and so doesn't have to care. Is there milage in that?
>
>...but I think I'm going on at too much length now, so I'll shut up!
>
>All the best,
>
>Norman
>
>
>--
>----------------------------------------------------------------------------
>Norman Gray / http://nxg.me.uk
>eurovotech.org / University of Leicester, UK
More information about the semantics
mailing list