TAP RFC [VOSI]

Wed Sep 30 03:33:26 PDT 2009

Sorry for a lengthy email, but I find this quite important for the  
overall VO architecture,
especially given that this touches immediately on TAP, but later it  
will be extended to all other
DAL interfaces.

On 29 Sep 2009, at 19:19, Patrick Dowler wrote:

> On Tuesday 29 September 2009 08:00:49 Alberto Micol wrote:
>> My point is that a client (TAP, SIA, SSA, etc) cannot know in advance
>> if its request
>> is too heavy for a given server. Even more so, if the same query is  
>> to
>> be sent to many different servers.
>
> You are forgetting that the service also cannot in general know that  
> the query
> is a heavy, time-consuming request that will exceed the http  
> timeouts of using
> the sync endpoint.

This is exactly what I meant by saying "too heavy"; I really meant  
time consuming
and exceeding http timeout.

But I also meant that the service receiving a question (being it  
queryData, getData, doQuery etc)
is the only one that knows what to do with such question.

I never had in mind the idea that the service should use a fabulously  
smart system
to estimate how long it will take to get the answer to any specific  
query. I know that this is not feasible.
What I meant is that a service is implemented around whatever local  
infrastructure,
and it is the service provider that decides up front, for a given data  
collection, knowing her own architecture,
if a getData is to be served using SYNC or ASYNC, if a doQuery is  
ASYNC or SYNC, and
so on and so forth.

The decision if to serve SYNC or ASYNC is usually taken a priori by  
the service, just based on the type of
operation  (getData, doQuery, etc) for any given data collection,  
without looking at the actual query content.

In this respect I think it makes no sense to ask for a SYNC treatment  
to a service that is not setup
to provide immediate answers. (the answer would be negative, what a  
waste of time)

Nor it makes sense to ask for ASYNC because this will force all data  
providers to put together a complex
machine. Even those providers that  WANT to offer a SIMPLE and quick  
service which could be easily
implemented with a SYNC mechanism will have to implement something  
much more complex.
A very unnecessary burden, very much against what the takeup committee  
wants to reach.

I would much prefer to see questions being posed without any SYNC,  
ASYNC request;
the service can then take it and decide what to do:
- send back the answer to the question if it can (SYNC by default), or  
otherwise,
- send back  a formal answer to inform the client that ASYNC (and UWS)  
is to be used.

No extra burdens to data providers, please!

And no extra burdens to the users either:
> In reality, users will try to do a query using sync and if it fails  
> they can
> either change the query or use async instead. If the user thought  
> the query
> was simple and fast they will likely examine it more closely for  
> bugs. If they
> know it is complex, they will maybe assume it is correct and try  
> async, or
> they may set MAXREC to something small and try sync again to test  
> it. I don't
> think the service can really make these decisions.

All that going back and forth is completely unnecessary (unless of  
course there is a real
bug in the question, but not otherwise).
- If the service decides upfront to use ASYNC (because it offers a  
huge catalog) the user
will simply send his query, and the answer will be to please use ASYNC.
- If the service decides upfront to use SYNC (because it offers access  
to a small catalog) the user
will simply send his query, and will receive her answer shortly.
Of course, the problem arises if a huge catalog is served only in SYNC  
mode, or if
a small catalog is served and the network connection is not that good.
Timeouts will likely  happen often in those two cases. In such case,  
yes, the user will have to limit
using MAXREC, if not done so by the provider herself. Some handling of  
the kind proposed by Pat
will always happen, but we should limit the number of cases to only  
the strictly necessary ones,
balancing it out with the burden otherwise imposed to data providers.

In one sentence: Why complicating things at both ends?

Alberto

>
> select * from someTable
> where INTERSECTS(spatial_bounds,circle('ICRS', 10,10,0.1) = 1
>
> This is a typical spatial query (cone search) in ADQL. If the table  
> is small,
> it will probably be fast. If the table has a spatial indexing scheme  
> on the
> spatial_bounds column, it will probably be faster than if it does  
> not. If the
> content is spread out and the actual condition is very selective, it  
> will be
> faster than if all the content is inside the circle.... can anyone  
> really
> plausibly determine ahead of time that this will be fast? probably  
> fast?
> probably slow? slow? Not plausibly, in my opinion. I can look at a  
> query and
> make a good guess about whether it wil be heavy or not, but I cannot  
> write
> software to make that guess for me :-)
>
>> To me, a query is always a SYNC query. If the service cannot answer
>> right away, the
>> service will politely inform the client that the request will take a
>> bit longer,
>> and will turn to ASYNC.
>