Problem details for HTTP APIs (rfc7807)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Feb 24 2022, 10:30 PM

Description

RESTBase/HyperSwitch (allegedly )uses RFC 7807 HTTP problems, JSON-encoded response objects that describe errors. We should determine/decide what attributes and values (if any) are considered a part of the AQS contract. Additionally, in any instances where the current production system deviates from RFC 7807 (intentional or otherwise), do we treat this as a bug (read: fix it), or reproduce this in the new system?

For example: The RFC describes the detail attribute as a string, but see the following:

$ curl -X GET -H 'accept: application/json' 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/EN.WIKIPEDIA.ORG/all-access/all-agents/Banana/daily/20190101/201901a2'
{
  "type": "https://mediawiki.org/wiki/HyperSwitch/errors/invalid_request",
  "method": "get",
  "detail": [
    "end timestamp is invalid, must be a valid date in YYYYMMDD format"
  ],
  "uri": "/analytics.wikimedia.org/v1/pageviews/per-article/EN.WIKIPEDIA.ORG/all-access/all-agents/Banana/daily/20190101/201901a2"
}
$

In the returned error object, detail is an array.

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTbase Sunset
In Progress	• DAbad	T263489 AQS 2.0
Resolved	BPirkle	T302536 Problem details for HTTP APIs (rfc7807)

Event Timeline

Eevans created this task.Feb 24 2022, 10:30 PM

Eevans added a subscriber: lbowmaker.Feb 25 2022, 7:48 PM

odimitrijevic edited projects, added Data-Engineering; removed Analytics.Mar 3 2022, 7:37 PM

odimitrijevic moved this task from Incoming (new tickets) to Serve on the Data-Engineering board.

• Clarakosi added subscribers: • nnikkhoui, BPirkle, codebug.Mar 15 2022, 2:09 PM

New problem type proposal below.
Changes include:

detail is now a string, not an array (per the RFC)
Add title field, and put what was previously in detail in title field
Remove type, as it was more or less synonymous with the HTTP status code message

{
  "title": "Invalid granularity", 
  "detail": "The granularity provided is not recognized for this endpoint, valid values for 
        granularity are x, y, and z."
  "method": "get"
}

Other notes:

Do not include an instance field (which would be the current value of the uri field). The current uri field points to the analytics.wikimedia.org uri, whereas ideally we would have uri be the uri of the proxy (e.g. the API Gateway URI). Since the request won't have that uri, omit the field.

In T302536#7796646, @nnikkhoui wrote:
New problem type proposal below.
Changes include:

detail is now a string, not an array (per the RFC)

Add title field, and put what was previously in detail in title field

Remove type, as it was more or less synonymous with the HTTP status code message
{
  "title": "Invalid granularity", 
  "detail": "The granularity provided is not recognized for this endpoint, valid values for 
        granularity are x, y, and z."
  "method": "get"
}
Other notes:

Do not include an instance field (which would be the current value of the uri field). The current uri field points to the analytics.wikimedia.org uri, whereas ideally we would have uri be the uri of the proxy (e.g. the API Gateway URI). Since the request won't have that uri, omit the field.

If we're including method, we could include status as well.

@Milimetric @BTullis Wondering if you had any opinion or oppositions to the proposed new Problem object?

This is very interesting, thanks for reaching out.
I can see the discrepancy between what AQS returns and RFC 7807 and I can see how the new Problem object would address that.

A few things stand out for me, however.

Our currently defined spec for the Problem object appears to be RFC7807 compliant, so we aren't compliant with our existing spec. From this page:

Therefore I'd be tempted to consider our current behaviour a bug to be fixed.

The test case that I ran was also interesting in that it returned two distinct errors within the array of the detail field.

With this query:

curl -X 'GET' \
  'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Unicycle%2520Hockey/monthly/20220310/2022031a' \
  -H 'accept: application/json'

I get this response:

So therefore any new implementation is going to have to work out what to do with these multiple errors: i.e. concatenate, use first, use last, rank in severity etc.

You mentioned removing the type field in the new Problem object

Remove type, as it was more or less synonymous with the HTTP status code message

As far as I can see from reading the RFC, the resulting Problem object wouldn't be valid: ref: https://datatracker.ietf.org/doc/html/rfc7807#section-3.1

Consumers MUST use the "type" string as the primary identifier for
the problem type; the "title" string is advisory and included only
for users who are not aware of the semantics of the URI and do not
have the ability to discover them (e.g., offline log analysis).

...and also https://datatracker.ietf.org/doc/html/rfc7807#section-4

New problem type definitions MUST document:

a type URI (typically, with the "http" or "https" scheme),

a title that appropriately describes it (think short), and

the HTTP status code for it to be used with.

Given those requirements, I'd be tempted to leave our existing Problem object as it stands, but try to correct the implementation subtly so that it matches the spec that we have already defined.

I might well have mis-read or failed to appreciate some detail with which you're more familiar, so please do feel free to tell me if I'm barking up the wrong tree.

I agree with making the spec match the implementation. I also agree with making the spec compliant with the RFC 7807 standard. It sounds like Ben's thinking is along the same lines, but I'm happy to actually read the spec and give a more thorough review if there's a need for it.

I thought a bit about how this might affect users that are relying on the current implementation, and whether we should keep serving the current endpoint and make the new one available on a new /v2/ path. I am leaning towards not doing that, even if it breaks some clients, because it's more of a bug fix than a new implementation. Worst case, if this causes people to be very upset with us, we can roll it back and version it at that point. I think the probability of that is very low since this is error handling and unlikely to be mission critical for anyone downstream.

In T302536#7817585, @BTullis wrote:

[ ... ]

You mentioned removing the type field in the new Problem object

Remove type, as it was more or less synonymous with the HTTP status code message

As far as I can see from reading the RFC, the resulting Problem object wouldn't be valid: ref: https://datatracker.ietf.org/doc/html/rfc7807#section-3.1

Consumers MUST use the "type" string as the primary identifier for
the problem type...

I keyed in on the part where it says "When this member is not present, its value is assumed to be about:blank.". I would also be fine with including it, and explicitly assigning it about:blank.

The reason I suggested omitting it is that the HyperSwitch errors are very generic (which makes sense because it's a framework), they map 1:1 to one with HTTP statuses, and don't seem (to me) to provide any value (we already know that status 400 is a bad request).

I see the following options here:

Omit type
Include type "about:blank" (same as above, but explicit)
Include the HyperSwitch types that correspond to HTTP statuses
Include new types that also only correspond to HTTP statues
Include new types that correspond to meaningful error types (think: bad granularity argument, invalid timestamp, or unrecognized agent string argument)

(4) is basically the same as (3), but without the (now inaccurate) HyperSwitch namespacing; Again, I don't think they provide any value. (1) and (2) land as being semantically identical to me -YMMV- but it's easiest, and honest (if you also don't believe that types corresponding 1:1 with HTTP statuses provide value). (5) is the Right Way™, it follows both the spirit and the letter of the RFC, but it could also be considered feature creep. :)

Eevans renamed this task from [DISCUSS]: Problem details for HTTP APIs (rfc7807) to Problem details for HTTP APIs (rfc7807).Apr 5 2022, 8:54 PM

Eevans triaged this task as Medium priority.

In T302536#7818374, @Milimetric wrote:

[ ... ]
I thought a bit about how this might affect users that are relying on the current implementation, and whether we should keep serving the current endpoint and make the new one available on a new /v2/ path. I am leaning towards not doing that, even if it breaks some clients, because it's more of a bug fix than a new implementation. Worst case, if this causes people to be very upset with us, we can roll it back and version it at that point. I think the probability of that is very low since this is error handling and unlikely to be mission critical for anyone downstream.

That was my thinking (speculation, really): That any downstream client reliant on this behavior would be unlikely.

I'm going to boldly attempt to summarize the discussion, and see if we've reached a point where we can move forward. If I've failed to capture current consensus, or overstepped in someway, please let me know!

For the new service, we will include the following attributes:

type
title
detail
method
status (because if we're including method...)

All of the values for type in HyperSwitch align 1:1 with the meaning of their corresponding HTTP status code. This doesn't provide any value (it's entirely redundant). The RFC says that in such cases type can be assigned a value of about:blank, and advises that the title then correspond to the status as well ("not found", "bad request", "unauthorized", etc). For what it's worth, HyperSwitch Problem titles also correspond to the HTTP statuses. Thus, the new services should use a type of about:blank, and title attributes that are named for the HTTP statuses they correspond to.

We are going to err on the side of RFC 7807 compatibility with respect to detail attributes (i.e. they will be a string, and not an array of strings), on the assumption that a change is not likely to be disruptive to downstream clients.

The one thing the discussion doesn't seem to have touched on (yet), is the uri attribute; The uri attribute seems to be a misguided attempt at providing what the RFC calls instance. I'm not sure we can provide a meaningful value for this in our environment where routing will result in a value incongruent with what the caller used in their request. Do we continue to include this information? And if so, do we include it as uri (not compliant), instance (compliant use of attribute, but perhaps wrong if it doesn't accurately reflect instance), or both?

In T302536#7833771, @Eevans wrote:

For the new service, we will include the following attributes:

We are going to err on the side of RFC 7807 compatibility with respect to detail attributes (i.e. they will be a string, and not an array of strings), on the assumption that a change is not likely to be disruptive to downstream clients.

Do we continue to include this information? And if so, do we include it as uri (not compliant), instance (compliant use of attribute, but perhaps wrong if it doesn't accurately reflect instance), or both?

This seems useful to me. The first part of the path is different, but what small confusion that causes seems less important than seeing the params that the service understood. It could be that the user would find a difference between what they pass and what is seen here, meaning some parsing went wrong somewhere. I think we should continue to include it, but I agree it's tricky to name. How about instance_part?

In T302536#7834045, @Milimetric wrote:

In T302536#7833771, @Eevans wrote:

For the new service, we will include the following attributes:

+1

We are going to err on the side of RFC 7807 compatibility with respect to detail attributes (i.e. they will be a string, and not an array of strings), on the assumption that a change is not likely to be disruptive to downstream clients.

+1

Do we continue to include this information? And if so, do we include it as uri (not compliant), instance (compliant use of attribute, but perhaps wrong if it doesn't accurately reflect instance), or both?

This seems useful to me. The first part of the path is different, but what small confusion that causes seems less important than seeing the params that the service understood. It could be that the user would find a difference between what they pass and what is seen here, meaning some parsing went wrong somewhere. I think we should continue to include it, but I agree it's tricky to name. How about instance_part?