Asynchronous querying and tabular data

Tue May 1 08:18:00 PDT 2007

Dear all,

Copied below is a useful discussion from a colleague of why access 
protocols like SIAP and SSAP don't extend so gracefully to large 
tabular data queries, and why therefore we shouldn't try to make 
TAP exactly conform to the model assumed by these protocols.

Cheers,
Kona
-----------------------------------------------------------------

If I understand the DAL model correctly, an evolved DAL service,
e.g. SIAP 2, has generically three parts:

  - a synchronous HTTP query that returns a VOTable;

  - an asynchronous data-staging operation;

  - a streaming source of event describing changes in the service
state.

The synchronous query can return, depending on the arguments and the
type of service, a data-set (as in Cone Search); or a table listing
available data-sets (SIAP, SSAP); or metadata describing the service
(all service types, but the metadata content varies between them).

The data-staging operation can be applied separately to the virtual
data-set described in each row of the query results. Each such
staging is a separate, asynchronous job. My understanding is that the
data staging operation controls the production of data; the data
staging URL doesn't supply the data stream. During data staging, the
requested data accumulate on the server and have to be  downloaded
later.

The event stream is a way of monitoring the data staging without
polling. The data staging itself is supposed to support polling for
progress, so the event stream is an optional feature for the client.

This disposition assumes that the query is quick, because the
catalogue queried is a simple, short list of data sets. Any lengthy
work is done in the data-staging operation. That's a reasonable
assumption for SIAP and SSAP, where the image/spectrum catalogue is
usually short. It's less safe an assumption for cone search, unless
the service restricts the search area. It's an extremely poor
assumption for TAP where queries are expensive and data staging less
of a problem. (For applications in general, it's a poor assumption
since the "query" may not have any scientific meaning. E.g. in
extracting a catalogue from an image, where is the "query"?)

I note that, for any service protocol, it's possible to artificially
divide it into a query and data staging, and to do the actual
computation in the data staging. For a data-processing application,
this might mean that the query returns a list of locations of results
files but those files are not computed until the data staging; but
the results of a computation are not usually independent, so staging
one implies staging all of them. For TAP, the "query" might return a
table listing a single data set with an access URI of the form http://
whatever/tap/stageData?ADQL=...; i.e. the real query is done during
the "data staging"; but this is forced and rather silly, and it
requires an extra HTTP operation plus the parsing of an extra VOTable
to do something rather simple.

In any of these DAL-like services, the results can be read with or
without data staging; their access references exist as soon as the
query completes. There doesn't seem to be a way for the client to
tell whether the results need staging. Presumably the client gets an
HTTP 404 when trying to read immediately a data set that should have
been staged.

The results of the synchronous query are assumed to be small enough
to return to the client via the control connection; c.f. OGSA-DAI and
DSA/Catalogue. This is a reasonable assumption for SSAP and SIAP; a
weak assumption for Cone; and a broken assumption for TAP. Delivery of
results to third parties is possible if those results are linked from
the query results but not possible when the ultimate results ARE the
immediate results of the query. Streaming delivery to a third party -
where the results are not cached in the originating service - is
possible in principal, but only by means of the recipient reading the
results URI and waiting; results cannot by pushed. This approach
fails if the results do not flow steadily; that would risk a time-out
in reading the URL. Therefore, streaming delivery done this way does
not fit well with the data staging.

I suggest that in TAP, and in applications in general, we usually
have an initial, atomic unit of work that takes an arbitrary amount
of time. Once this work is completed, there exist results data-sets
that can be immediately downloaded. The initial work needs to be
controlled asynchronously, and the downloading of results can be
synchronous as the data flow continuously. A synchronous query with
asynchronous data-staging is exactly the opposite of what is needed.

In summary, the DAL model of services is well fitted to the special
cases of image and spectrum servers where:

  - there is a small catalogue on which all queries are quick; and

  - there are multiple, independent files of results that may be
computed separately, on demand.

The DAL model is a poor fit to any other kind of service. It can be
patched up to serve other cases, but it is fundamentally wrong for
TAP and for general applications. For TAP, I suggest that we apply
the UWS pattern directly to produce asynchronous controls for the
query.

-- 
Kona Andrews        kea at roe.ac.uk
AstroGrid Project   http://www.astrogrid.org
IfA, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ