[p3t] On structured errors

Mon Mar 25 15:03:40 CET 2024

Hi Russ,

I'd like to suggest a couple of options that could enable us to 
categorise the different 422 errors by adding some metadata links to 
each error message.

# -----------------------------------------------------

The first option is a `namespace` field that contains a URL that points 
to a human readable document that describes the error codes in that 
namespace.

In the case of our IVOA standards, the `namespace` field would contain 
the `standardID` of the relevant specification. In turn, each IVOA 
standard document should define a list of the error codes that apply for 
that standard.

This allows us to link the error messages to the relevant layer in the 
IVOA architecture e.g. UWS, DALI, TAP, ADQL etc.

     httperror: 492
     namespace: "ivo://ivoa.net/std/UWS#rest-1.1"
     errorcode: 12
     message: "Service unavailable, retry after [120] seconds"

     httperror: 422
     namespace: "ivo://ivoa.net/std/TAP#1.1"
     errorcode: 10
     message: "Unknown output format [TSV]"

     httperror: 422
     namespace: "ivo://ivoa.net/std/ADQL#2.1"
     errorcode: 1012
     message: "Invalid column reference [cas] line [5]"

Note that the `namespace` URL is not intended to be de-referenced at 
runtime by the client software. It has two separate functions. First, as 
the field name suggests, is to provide a namespace that enables us to 
distinguish between similar error codes from different specifications 
e.g. UWS error code 12 and TAP error code 12.

     httperror: 492
     namespace: "ivo://ivoa.net/std/UWS#rest-1.1"
     errorcode: 12
     message: "Service unavailable, retry after [120] seconds"

     httperror: 422
     namespace: "ivo://ivoa.net/std/TAP#1.1"
     errorcode: 12
     message: "Missing upload table [table-three]"

The `errorcode` alone is not guaranteed to be unique, but the 
combination of `namespace` and `errorcode` is, which means we don't have 
to have a central authority checking that none of the standards use the 
same error codes.

Second, it provides an off-line method for developers to lookup the 
details of the error message and learn how to handle it. In the case of 
an application or implementation specific message, including 3rd party 
services that are not part of an IVOA standard, the `namespace` field 
should contain a URL that points to a human readable resource that 
describes the error codes for that service or application.

     httperror: 422
     namespace: "http://github/my-project/error-codes.md5"
     errorcode: 34
     message: "Algorithm not supported [algo-21]"

# -----------------------------------------------------

The second option builds on the `namespace` idea to define a schema for 
machine readable metadata describing the error messages.

Each structured error message would include an `errormeta` field that 
contains a URL that points to machine readable metadata about that type 
of error.

In the case of our IVOA standards, the `errormeta` field would point to 
a JSON defined as part of the standard and published alongside the 
standard document, which provides machine readable details about the 
message, it's severity, how to interpret the error message, how to 
display it, and language translations of the message text.

     httperror: 492
     errorcode: 12
     errormeta: 
"https://www.ivoa.net/static/errorcodes/uws-rest-1.1.json"
     message: "Service unavailable, retry after [120] seconds"

     httperror: 422
     errorcode: 10
     errormeta: "https://www.ivoa.net/static/errorcodes/tap-1.1.json"
     message: "Invalid format [special]"

     httperror: 422
     errorcode: 1012
     errormeta: "https://www.ivoa.net/static/errorcodes/adql-2.1.json"
     message: "Invalid column reference [albert] line [5]"

Similarly, 3rd party service providers and implementations could provide 
a URL pointing to a JSON document using the same schema to describe 
their error messages.

     httperror: 422
     errorcode: 34
     errormeta: "http://github/my-project/error-codes.json"
     message: "Algorithm not supported [algo-21]"

# -----------------------------------------------------

Both methods can be used alongside each other, allowing the client to 
select which one it uses. The simple `namespace` option pointing to 
human readable off-line documentation, and the on-line `errormeta` 
option that provides machine readable metadata for the client to use at 
runtime.

     httperror: 492
     namespace: "ivo://ivoa.net/std/UWS#rest-1.1"
     errorcode: 12
     errormeta: 
"https://www.ivoa.net/static/errorcodes/uws-rest-1.1.json"
     message: "Service unavailable, retry after [120] seconds"

     httperror: 422
     namespace: "ivo://ivoa.net/std/TAP#1.1"
     errorcode: 10
     errormeta: "https://www.ivoa.net/static/errorcodes/tap-1.1.json"
     message: "Invalid format [special]"

     httperror: 422
     namespace: "ivo://ivoa.net/std/ADQL#2.1"
     errorcode: 1012
     errormeta: "https://www.ivoa.net/static/errorcodes/adql-2.1.json"
     message: "Invalid column reference [albert] line [5]"

     httperror: 422
     namespace: "http://github/my-project/error-codes"
     errorcode: 34
     errormeta: "http://github/my-project/error-codes.json"
     message: "Algorithm not supported [algo-21]"

Hope this is useful,
-- Dave

--------
Dave Morris
Research Software Engineer
UK SKA Regional Centre
Department of Physics and Astronomy
University of Manchester
--------
AIMetrics: []
--------

On 2024-03-12 19:12, Russ Allbery via p3t wrote:
> Apologies for how long it's taken me to write up some initial thoughts 
> on
> this.
> 
> I think there was general consensus in our last meeting that we would 
> like
> to define a protocol for structured errors for IVOA protocols.  This
> message lays out some initial thoughts to start that discussion.  It is 
> in
> three parts: a discussion of HTTP error codes, a discussion of some
> features for structured errors that would be appealing, and a more
> detailed look at an existing structured error protocol that we could 
> use
> for inspiration, namely that used by FastAPI.
> 
> HTTP error codes
> ================
> 
> I believe existing IVOA standards already say that appropriate HTTP 
> error
> codes should be used when returning errors, and we will want to stick 
> with
> that.  HTTP divides the world into two classes of errors: 4xx errors,
> which indicate a problem with the client's request, and 5xx errors, 
> which
> indicate a server-side problem.
> 
> 4xx errors are the more interesting and varied.  The obvious errors are
> 401 (authentication required but not provided or incorrect), 403
> (permission denied), and 404 (resource not found).  I don't think 
> there's
> much controversy over when these should be used, so I'll pass over 
> them,
> except to note that 401 and 403 errors already define a structured 
> error
> mechanism in the WWW-Authenticate header, which should be used when
> returning those responses.  Also returning a structured error body is
> allowed but I think shouldn't be required, since it should be possible 
> to
> include all of the required information in WWW-Authenticate and clients
> should expect to find it there anyway.
> 
> Most other 4xx errors are for errors at the HTTP protocol layer, below 
> the
> scope of the IVOA standards.  The remaining interesting error codes are
> 400, 422, and 429.
> 
> 429 is a rate limit error.  Here, we should ask services to include a
> Retry-After header where possible, in addition to providing a 
> structured
> error body with information about the rate limit if possible.  Between 
> the
> two, Retry-After is probably more important since it's an HTTP standard
> for rate limited responses.  A structured error body may not be 
> possible
> depending on the implementation (for example, the rate limiting may be
> done by an upstream hardware load balancer that doesn't understand IVOA
> protocols).
> 
> For the remaining two, 400 is the catch-all error code for any error in
> the client request not covered by other error codes.  422 is a 
> less-used
> error code that was originally introduced for WebDAV that indicates 
> that
> the request was a valid HTTP request but couldn't be processed by the
> server due to semantic errors.
> 
> FastAPI uses the 422 error code to represent an input validation error, 
> as
> distinct from an error in the semantics of the underlying high-level
> protocol.  (Note that both of these are, from an *HTTP* perspective, 
> valid
> requests with semantic errors.)  In other words, if one passes a string 
> in
> a numeric field or does not include a required parameter, FastAPI
> generates a 422 error.  I think we should consider embracing this
> distinction since it's useful for client debugging to be able to see at 
> a
> glance that a request was malformed as opposed to having some other
> problem that would pass input validation (requesting too large of a 
> search
> radius for a cone search, for example).
> 
> I think every 400 and 422 error returned by a protocol implementation
> should use a structured error body, but note that clients should 
> probably
> not *require* a structured error body since it's always possible in 
> HTTP
> service implementations that some upstream intermediary will return 
> some
> error.  This will more commonly be a 5xx error, but 4xx errors are
> possible (431 errors indicating the request headers were too large, for
> example).
> 
> There is less to say about 5xx errors, and in a lot of cases the body 
> of
> the error will be out of our control, so clients can't assume much.  I
> would encourage implementations to return structured errors for 500 
> errors
> where possible, but best effort is all we can do.
> 
> Features for structured errors
> ==============================
> 
> Some useful things to include in structured errors:
> 
> * A human-readable error description.  We may want to consider 
> supporting
>   two fields, one for a short error and one for extended error details,
>   since that can aid GUI clients that want to display the 
> human-readable
>   error to a user.  For example, if one is displaying the error in red
>   text in an input screen, knowing there is a field that contains a 
> short
>   error and won't contain, for example, a 100 line traceback is very
>   helpful.  The error details should be optional, since not all errors
>   will have extended details.
> 
> * An error code intended for software consumption.  It's a lost cause 
> to
>   attempt to catalog all possible errors and assign codes to all of 
> them,
>   but there are certain types of errors that we can anticipate for a 
> given
>   protocol and that are useful for software to be able to reliably 
> parse,
>   regardless of how the system choses to explain them to users.  
> (Again,
>   for example, too large of a search radius is a predictable cone 
> search
>   protocol error that we could assign an error code to.)  Error codes
>   allow error messages for humans to be localized in an appropriate
>   language while still allowing common software implementations to
>   recognize certain types of errors.
> 
>   Since we can't provide a comprehensive list of possible error codes,
>   there are two possible approaches for errors that don't match an
>   existing code: use a generic code for all of those errors (like 
> "error")
>   or omit the code entirely, or allow the implementation to make up its
>   own error codes.  I personally prefer the latter, since it leaves 
> open
>   some useful collaboration between locally-written clients and servers
>   for codes that are specific to a given implementation.  If we take 
> that
>   approach, non-standard error codes should probably use some sort of
>   prefix or a different structured field to distinguish them from 
> standard
>   errors.  I think I prefer a different structured field.
> 
> * For errors that are specific to a particular input parameter, a
>   designation of which input parameter was in error.  This is important
>   for GUI clients, since it allows mapping the error to a specific 
> input
>   field and showing field-specific errors to the user.  To reuse the
>   search radius example, if the server's structured error says that the
>   error is with the search radius, the client can map that to the 
> search
>   radius input field and show the error next to that field, instead of
>   showing a more generic field at the top or bottom of the input area.
> 
> * Some protocols (GitHub's API, for instance) provide URLs in error
>   responses that go to a page that provides more details about that 
> error,
>   possible causes, etc.
> 
> * It's sometimes useful to echo the specific field value that triggered
>   the error back to the user, particularly if it's in a deeply nested 
> part
>   of a complex input.
> 
> Note that a given request may have multiple errors (this is 
> particularly
> common for requests that fail input validation), and therefore the
> structured error body should be a list of errors.  Clients can choose 
> to
> only process the first error to save on client complexity, and we 
> should
> explicitly bless that and indicate that services should attempt to put 
> the
> most important error first if they are returning multiple errors.
> 
> Another issue worth considering is localization, namely how to return
> errors in multiple languages and/or how to indicate the language of the
> error response.  I'm not sure if we want to tackle this; localization 
> is a
> huge topic that deserves its own expertise and careful design.  It's
> something that could be deferred to a later day, as long as we use an
> extensible structured error protocol.  I think the important thing we 
> can
> do for the first round is to add error codes where possible, since they
> allow subsequent localization of the human-readable error without 
> breaking
> software that needs to understand the type of error.
> 
> One possible protocol
> =====================
> 
> The FastAPI structured error protocol is just a serialization of the
> Pydantic error structure.  It is a JSON list of JSON objects with the
> following fields of interest:
> 
> input
>     The input value that failed validation.
> 
> loc
>     The input parameter that triggered the error.  This is a list that
>     represents a path in the request.  The first element indicates 
> whether
>     the error is in a query parameter, header, body field, etc.  The
>     subsequent parameters indicate the name of the parameter, header, 
> or
>     body field.  For structured bodies, this is a list of keys that
>     essentially form a JSON path in the body.
> 
> msg
>     The human-readable error.
> 
> type
>     The error code.
> 
> url
>     A URL that contains more information about this error.
> 
> This maps to the features described above.  These field names aren't
> great, and I don't think are the ones we'd choose, but I think this is 
> an
> interesting set of data to consider.
> 
> As discussed above, I would change "type" to two separate error code
> fields, one that holds the standardized error code for this error if 
> one
> is available, and the other of which holds an ad hoc local error code 
> that
> will vary from implementation to implementation but will be consistent 
> for
> a given implementation.  The second should only be used if there is no
> appropriate standardized error code, I think.  In that case, I'm not 
> sure
> if the first error code should be omitted or if it should be set to a
> generic code like "error."  I think the latter probably is better; it
> creates fewer edge cases.
> 
> --
> Russ Allbery (eagle at eyrie.org)             
> <https://www.eyrie.org/~eagle/>