Thoughts on standardizing TAP timeout/error handling

Stelios Voutsinas SVoutsinas at lsst.org
Wed Jul 16 18:52:46 CEST 2025


Hi everyone,

I wanted to bring up an issue we've been discussing over in the pyvo repo (astropy/pyvo#686) about TAP clients not being able to reliably distinguish query timeouts from other failures.

The basic motivation and problem in question here is that different services handle timeouts in different ways, some return specific HTTP codes, others just drop connections, and it does not currently seem possible for clients to distinguish timeouts from the error message.
This makes it really hard for clients to know what actually went wrong and whether they should suggest something useful to the user (like "hey try async mode instead").

An initial thought was to investigate if there are any appropriate HTTP codes like 408, but as Mark Taylor pointed out, these aren't appropriate for processing timeouts, and other than that there doesn't seem to be a good standard HTTP code for "query took too long to process." So we've been thinking about other approaches.


An idea that was brought up is extending how we handle errors in DALI, but in a way that doesn't break existing clients. The idea is to keep QUERY_STATUS simple (still just ERROR, OK, OVERFLOW) but add some structured info alongside it.

Something like this:


<VOTABLE>
  <RESOURCE type="results">
    <INFO name="QUERY_STATUS" value="ERROR">Query exceeded processing time limit</INFO>
    <INFO name="ERROR_TYPE" value="resource-limit"/>
    <INFO name="ERROR_SUBTYPE" value="time-limit-exceeded"/>
  </RESOURCE>
</VOTABLE>


The nice thing about this is that old clients just see ERROR and handle it however they already do, but newer clients can check the ERROR_TYPE and do something smarter like suggesting async for timeouts, or asking the user to reduce their result size for storage limits.

I'm thinking we could have error types like:
- resource-limit (with subtypes like time-limit-exceeded, storage-limit-exceeded)
- authorization (authentication-expired, insufficient-permissions)
- query-syntax (invalid-adql, table-not-found, etc.)

We could even include some additional context when it's helpful:

<INFO name="ERROR_TYPE" value="resource-limit"/>
<INFO name="ERROR_SUBTYPE" value="time-limit-exceeded"/>
<INFO name="SYNC_LIMIT_SECONDS" value="30"/>
<INFO name="ASYNC_LIMIT_SECONDS" value="3600"/>


Though I can see the argument that this descriptive metadata might be over-complicating things, and this depends on whether clients would actually use it.
I think the main thing is agreeing on the error types and subtypes first.


This would give clients enough info to make better decisions about retries or remedy recommendation to the user without being too prescriptive about what they should do.
Markus suggested this might be a good candidate for a standard vocabulary, which makes sense to me. And the approach seems extensible enough that we could add new error types later without breaking anything.
I think this could be something we try out in a few implementations (DaCHS, Rubin, pyvo, maybe TOPCAT) and see how it works in practice before making it official in DALI.


What do you all think? Does this seem like a reasonable direction? Are there other error scenarios we should be thinking about? Any obvious problems with this approach?

The timeout issue is relevant for Rubin because we've decided that we have different timeouts for async vs sync queries and want to push users towards using async, so this would be one step in the direction of making this easier.

So it would be great to find a way forward on this if this seems reasonable. (The full discussion is in the GitHub issue if anyone wants more context on this)

Thanks,

Stelios Voutsinas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20250716/57a53f1d/attachment.htm>


More information about the dal mailing list