[p3t] On structured errors

Tue Mar 12 20:12:11 CET 2024

Apologies for how long it's taken me to write up some initial thoughts on
this.

I think there was general consensus in our last meeting that we would like
to define a protocol for structured errors for IVOA protocols.  This
message lays out some initial thoughts to start that discussion.  It is in
three parts: a discussion of HTTP error codes, a discussion of some
features for structured errors that would be appealing, and a more
detailed look at an existing structured error protocol that we could use
for inspiration, namely that used by FastAPI.

HTTP error codes
================

I believe existing IVOA standards already say that appropriate HTTP error
codes should be used when returning errors, and we will want to stick with
that.  HTTP divides the world into two classes of errors: 4xx errors,
which indicate a problem with the client's request, and 5xx errors, which
indicate a server-side problem.

4xx errors are the more interesting and varied.  The obvious errors are
401 (authentication required but not provided or incorrect), 403
(permission denied), and 404 (resource not found).  I don't think there's
much controversy over when these should be used, so I'll pass over them,
except to note that 401 and 403 errors already define a structured error
mechanism in the WWW-Authenticate header, which should be used when
returning those responses.  Also returning a structured error body is
allowed but I think shouldn't be required, since it should be possible to
include all of the required information in WWW-Authenticate and clients
should expect to find it there anyway.

Most other 4xx errors are for errors at the HTTP protocol layer, below the
scope of the IVOA standards.  The remaining interesting error codes are
400, 422, and 429.

429 is a rate limit error.  Here, we should ask services to include a
Retry-After header where possible, in addition to providing a structured
error body with information about the rate limit if possible.  Between the
two, Retry-After is probably more important since it's an HTTP standard
for rate limited responses.  A structured error body may not be possible
depending on the implementation (for example, the rate limiting may be
done by an upstream hardware load balancer that doesn't understand IVOA
protocols).

For the remaining two, 400 is the catch-all error code for any error in
the client request not covered by other error codes.  422 is a less-used
error code that was originally introduced for WebDAV that indicates that
the request was a valid HTTP request but couldn't be processed by the
server due to semantic errors.

FastAPI uses the 422 error code to represent an input validation error, as
distinct from an error in the semantics of the underlying high-level
protocol.  (Note that both of these are, from an *HTTP* perspective, valid
requests with semantic errors.)  In other words, if one passes a string in
a numeric field or does not include a required parameter, FastAPI
generates a 422 error.  I think we should consider embracing this
distinction since it's useful for client debugging to be able to see at a
glance that a request was malformed as opposed to having some other
problem that would pass input validation (requesting too large of a search
radius for a cone search, for example).

I think every 400 and 422 error returned by a protocol implementation
should use a structured error body, but note that clients should probably
not *require* a structured error body since it's always possible in HTTP
service implementations that some upstream intermediary will return some
error.  This will more commonly be a 5xx error, but 4xx errors are
possible (431 errors indicating the request headers were too large, for
example).

There is less to say about 5xx errors, and in a lot of cases the body of
the error will be out of our control, so clients can't assume much.  I
would encourage implementations to return structured errors for 500 errors
where possible, but best effort is all we can do.

Features for structured errors
==============================

Some useful things to include in structured errors:

* A human-readable error description.  We may want to consider supporting
  two fields, one for a short error and one for extended error details,
  since that can aid GUI clients that want to display the human-readable
  error to a user.  For example, if one is displaying the error in red
  text in an input screen, knowing there is a field that contains a short
  error and won't contain, for example, a 100 line traceback is very
  helpful.  The error details should be optional, since not all errors
  will have extended details.

* An error code intended for software consumption.  It's a lost cause to
  attempt to catalog all possible errors and assign codes to all of them,
  but there are certain types of errors that we can anticipate for a given
  protocol and that are useful for software to be able to reliably parse,
  regardless of how the system choses to explain them to users.  (Again,
  for example, too large of a search radius is a predictable cone search
  protocol error that we could assign an error code to.)  Error codes
  allow error messages for humans to be localized in an appropriate
  language while still allowing common software implementations to
  recognize certain types of errors.

  Since we can't provide a comprehensive list of possible error codes,
  there are two possible approaches for errors that don't match an
  existing code: use a generic code for all of those errors (like "error")
  or omit the code entirely, or allow the implementation to make up its
  own error codes.  I personally prefer the latter, since it leaves open
  some useful collaboration between locally-written clients and servers
  for codes that are specific to a given implementation.  If we take that
  approach, non-standard error codes should probably use some sort of
  prefix or a different structured field to distinguish them from standard
  errors.  I think I prefer a different structured field.

* For errors that are specific to a particular input parameter, a
  designation of which input parameter was in error.  This is important
  for GUI clients, since it allows mapping the error to a specific input
  field and showing field-specific errors to the user.  To reuse the
  search radius example, if the server's structured error says that the
  error is with the search radius, the client can map that to the search
  radius input field and show the error next to that field, instead of
  showing a more generic field at the top or bottom of the input area.

* Some protocols (GitHub's API, for instance) provide URLs in error
  responses that go to a page that provides more details about that error,
  possible causes, etc.

* It's sometimes useful to echo the specific field value that triggered
  the error back to the user, particularly if it's in a deeply nested part
  of a complex input.

Note that a given request may have multiple errors (this is particularly
common for requests that fail input validation), and therefore the
structured error body should be a list of errors.  Clients can choose to
only process the first error to save on client complexity, and we should
explicitly bless that and indicate that services should attempt to put the
most important error first if they are returning multiple errors.

Another issue worth considering is localization, namely how to return
errors in multiple languages and/or how to indicate the language of the
error response.  I'm not sure if we want to tackle this; localization is a
huge topic that deserves its own expertise and careful design.  It's
something that could be deferred to a later day, as long as we use an
extensible structured error protocol.  I think the important thing we can
do for the first round is to add error codes where possible, since they
allow subsequent localization of the human-readable error without breaking
software that needs to understand the type of error.

One possible protocol
=====================

The FastAPI structured error protocol is just a serialization of the
Pydantic error structure.  It is a JSON list of JSON objects with the
following fields of interest:

input
    The input value that failed validation.

loc
    The input parameter that triggered the error.  This is a list that
    represents a path in the request.  The first element indicates whether
    the error is in a query parameter, header, body field, etc.  The
    subsequent parameters indicate the name of the parameter, header, or
    body field.  For structured bodies, this is a list of keys that
    essentially form a JSON path in the body.

msg
    The human-readable error.

type
    The error code.

url
    A URL that contains more information about this error.

This maps to the features described above.  These field names aren't
great, and I don't think are the ones we'd choose, but I think this is an
interesting set of data to consider.

As discussed above, I would change "type" to two separate error code
fields, one that holds the standardized error code for this error if one
is available, and the other of which holds an ad hoc local error code that
will vary from implementation to implementation but will be consistent for
a given implementation.  The second should only be used if there is no
appropriate standardized error code, I think.  In that case, I'm not sure
if the first error code should be omitted or if it should be set to a
generic code like "error."  I think the latter probably is better; it
creates fewer edge cases.

-- 
Russ Allbery (eagle at eyrie.org)             <https://www.eyrie.org/~eagle/>