Asynchronous querying and tabular data

Wed May 2 07:35:39 PDT 2007

On Tue, 1 May 2007, Patrick Dowler wrote:

> On Tuesday 01 May 2007 21:02, Doug Tody wrote:
>> The result, if a large query is attempted synchronously, is truncation
>> or an error response; alternatively, we for serious large queries we
>> have a two-stage operation involving estimation and job submission.
>> This is basically what queryData/stageData concept already provides.
> 
> I am afraid you have lost me here. I see no reason to infer that
> queryData is some sort of estimate on the work required to do the
> real thing. In SIA it is a query and returns the query result. It
> happens that the query result itself describes something else and
> one column (hopefully) contains a URL to the something else. It is
> not an estimate.

In SIA/SSA etc., when used to access virtual data, queryData represents
a contract between the client and server, specifying for each row of
the output table, a data product which could be produced.  This can
be referenced back to the service, e.g,. with stageData, to have it
go off and do the computation.

Currently the query response only supports synchronous data access,
so there is no indication of the computational cost or time-to-run
of a job (although the output dataset size is estimated).  However,
as we add support for asynchronous operations to the DAL services,
we just need to add this information, for services which support
async operations as an optional capability.  The query response can
tell whether or not a dataset can be computed synchronously, and if
not, estimate the size of the computation required to produce it.
The client can then either repeat the query to refine the job
specification (e.g., ask for something smaller), or stage the request.

The concept with stageData is that it references one or more of
these virtual data products (tasks?), and initiates a single batch
job to compute all of them.  The job might compute only a single
computationally intensive dataset, or it might compute thousands of
smaller datasets in parallel.  For each data product, the stageData
request will also need to specify disposition, e.g., is the data
to be staged locally, or delivered to a remote VOSpace.  If data is
staged locally, a streaming GET (normal synchronous getData) can be
used for retrieval, even of very large datasets.

> TAP queries may contain a column with a URL to something, but the
> standard case is that the query result is something in its own right
> and not generally the first of two stages of work. In this light,
> I think it is a perfectly reasonable interpretation of typical DAL
> style to say that queryData is a synchronous method that returns a
> query result.

Right; in a simple TAP query against a data table, probably the
operation should be synchronous, and return the table data directly
(and this will be enough for many queries).  If this mode is used
for a large query, probably all we can do is truncate the result,
or return an error.  In that case there is probably no alternative
to a two-step process of estimation followed by a staging request.

 	- Doug