TAPRegExt: sync/async preference
Stelios Voutsinas
SVoutsinas at lsst.org
Thu Jul 17 20:27:43 CEST 2025
Dear DAL,
Some thoughts on this interesting discussion:
I do like the idea of Pat's "option 2" and Paul's 504-with-job-URL approach.
For services with resource constraints, I think this addresses a real operational problem.
In our case for example, we run sync queries asynchronously underneath in an event-based architecture, but supporting sync forces us to keep HTTP connections open, tying up connection pools and memory for what are fundamentally async operations. The continue-as-async pattern would let us be honest about our architecture by starting the query immediately and then releasing the connection if it doesn't complete quickly versus maintaining expensive synchronous operations that don't scale with user load.
On the question of why not just use async?
I think there are good reasons clients may choose to default to sync beyond just simplicity, since sync is stateless and perhaps slightly resource-efficient for clients, i.e one HTTP request, one response, done versus creating jobs, polling loops, cleanup logic and maintaining state across multiple requests.
If the majority of the queries complete within the sync timeout limits for most services sync's resource efficiency makes it a reasonable choice to default.
In fact as far as I can see pyvo (search method) and Topcat default to sync, but then users hit service timeouts on queries that could complete if given more time.
Also I'm not sure if I would classify this as a breaking change. This actually improves the failure mode rather than breaking it from my perspective.
Currently: sync timeout -> cryptic error -> user frustration (?).
With continue-as-async: sync timeout -> clear "query running, check here" message.
For "dumb" clients, the user would just get a VOTable back with a message indicating that the job is still executing, and a link to the async job instead of a 504 or cryptic message, whereas smart clients can auto-switch to polling. The success cases remain identical, so I'm not sure if this would be breaking anything.
In terms of the implementation complexity, I do think this would mainly benefit services that run sync jobs as async in the background, and for those I think the implementation seems straightforward.
For other services which would require architectural changes to support this, I think the optional nature of this allows them to choose whether to spend the effort implementing it. Services could also check whether the user has polled the async endpoint to make the decision whether to stop the query in this case as well if there is concern about computational load of keeping the query running in this case.
The fundamental issue I think is that sync mode is structurally limited by HTTP timeouts and connection pools, but potentially can be seen as a good starting point for both users and client libraries. At the same time there is always the question of how visible the TAP execution mode (async/sync) should be to the user vs simplifying the process and number of steps they need to go through get the results they are interested in.
I do see an argument for why not just improve the error messages pointing users to async. But I still see a couple limitations that are not addressed if we go that route, which is that the UX experience still isn't great for users of a service that has very low sync limits. They'd have to remember to switch to async mode each time they try to use the client, and in some cases if they are unaware of the protocols it may not be obvious what they have to do. Also we double the computational load, since each query is essentially run twice. Perhaps this is where Markus' idea comes in of advertising limits and a preference for run mode. Where this gets tricky from my point of view is for TAP_SCHEMA queries as mentioned, where we know they will complete quickly and thus they follow a different execution path where sync makes sense.
In general for our service both "continue-as-async" and the "preference-in-the-capabilities" approach would help, assuming clients like pyvo/topcat would implement changes to either follow the redirect or switch the execution mode accordingly, slightly leaning towards Option 2 because of the flexibility and the TAP_SCHEMA limitations of the metadata approach. But at the same time Marcus's original proposal is probably the more friction-less approach so I think that would also be a good way forward if the alternative is too contentious.
Cheers,
Stelios Voutsinas
________________________________
From: dal <dal-bounces at ivoa.net> on behalf of Grégory Mantelet via dal <dal at ivoa.net>
Sent: Thursday, July 17, 2025 2:02 AM
To: Paul Harrison
Cc: dal at ivoa.net
Subject: Re: TAPRegExt: sync/async preference
Thank you Paul for this suggestion. I just commented on the original idea.
Anyway, with the suggestion you've proposed, it means that a service able to
provide such advanced/smart sync response would have to:
- either always run the query asynchronously
(even though the sync layer said it timed out, it would keep running in the
potential hope someone may click on the "wait for the response" link and
switch on the async endpoint)
- or will have to restart the query if the user click on the "wait for the response"
link
Maybe there is another technical server solution. If not, any of these solutions
convince me that this server complexity worth the effort. Let's keep things
simple.
I still think that such behavior should be applied on client side. And that would
be even more possible thanks to what Stelios has just proposed with a
standardized error type in DALI answers. If a sync query fails with a time out
error, the client can propose to try again in async mode. Then, as I said, a client
may also decide to run everything on async mode and show immediately
the result if it succeeded quickly (as TAPHandle does).
Cheers,
Grégory
On 15/07/2025 18:22, Paul Harrison wrote:
On 15 Jul 2025, at 14:41, Gregory MANTELET via dal <dal at ivoa.net><mailto:dal at ivoa.net> wrote:
To be honest, I don't like either this false-sync endpoint. Changing that way
the behavior of this endpoint is clearly a breaking change
What I suggested is a backwards compatible extension for “dumb" clients assuming that they always treat a non-2xx http response is an error. It is still a sync endpoint that can time-out (which could happen in the current UWS/TAP standard definitions) - it is just that the timeout *could* contain information that would allow a slightly less dumb client to see that it might be possible still to obtain the result if they are prepared to wait longer and do the full async protocol. Simple clients do not have to do any more work than before.
Paul.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20250717/712bf5c7/attachment.htm>
More information about the dal
mailing list