Page MenuHomePhabricator

Custom OAuth 2 error from Wikimedia infrastructure breaks automatic retry of requests
Open, Needs TriagePublicBUG REPORT

Description

Summary: A custom error structure returned by the Wikimedia infrastructure for requests with an expired OAuth 2 access token prevents m3api-oauth2, and possibly other MediaWiki OAuth 2 libraries, from automatically refreshing the token and retrying the request.

Steps to replicate the issue (include links if applicable):

  • git clone https://gitlab.wikimedia.org/repos/m3api/m3api-examples.git (m3api-examples)
  • cd m3api-examples/webapp-clientside-vite-guestbook OR cd m3api-examples/webapp-serverside-express-guestbook (both are affected but will display the error in different ways; the client-side example offers more visibility because you can see all the relevant request in the browser’s dev tools)
  • fresh-node -- npm install
  • fresh-node -net -- npm run start
  • go to http://localhost:8080/
  • log in via OAuth
  • optionally, click the “sign guestbook” button right now, just to confirm that it works; this should make a successful edit to the m3api-examples guestbook page
  • wait four hours for the access token to expire
    • if you’re feeling very adventurous, I suppose you could configure a lower $wgOAuth2GrantExpirationInterval (example) on mw-experimental and hack m3api (e.g. in node_modules/m3api/fetch.js) to include a request header like X-Wikimedia-Debug: backend=mw-experimental, then wait for a shorter expiration interval (you might need to do this before logging in, I’m not sure)
  • click the “sign guestbook” button

What happens?:
Envoy(?) returns a response with the HTTP status code 401 (Unauthorized) and the JSON body {"httpCode":401,"httpReason":"Jwt is expired"} (probably Envoy::JwtVerify::Status::JwtExpired). If I understand correctly, the request never even makes it to MediaWiki. (A similar error was previously seen in T417839.)

In the client-side example, this will result in a CORS error because the response does not have an Access-Control-Allow-Origin header. m3api’s code can’t even see the response.

In the server-side example, m3api detects this as an unexpected, non-MediaWiki error (because the status code is not 200 and there is no MediaWiki-API-Error response header) and aborts. m3api-oauth2 can’t register a handler for this kind of error.

What should have happened instead?:
The guestbook should be signed successfully.

The expected MediaWiki response to the request with the expired access token looks something like this:

{ "errors": [ { "code": "mwoauth-invalid-authorization" } ] }

(Though the exact format will vary depending on the request parameters, mainly the errorformat=.) m3api-oauth2 registers an error handler for this error code which automatically refreshes the access token and then retries the request, transparently to the application.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):
N/A, I think this problem is only present in Wikimedia infrastructure (I test the automatic refresh against a local wiki in CI and it still works there).

Other information (browser name/version, screenshots, etc.):
In the server-side example, the error looks like this:

API request returned non-200 HTTP status code: 401

Error: API request returned non-200 HTTP status code: 401
    at NodeSession.request (file:///webapp-serverside-express-guestbook/node_modules/m3api/core.js:569:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async NodeSession.requestAndContinue (file:///webapp-serverside-express-guestbook/node_modules/m3api/core.js:625:21)
    at async NodeSession.getToken (file:///webapp-serverside-express-guestbook/node_modules/m3api/core.js:706:22)
    at async NodeSession.request (file:///webapp-serverside-express-guestbook/node_modules/m3api/combine.js:28:4)
    at async file:///webapp-serverside-express-guestbook/routes/index.js:51:19

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm not sure anything outside MediaWiki does JSON responses? Also the JWT is validated in three layers (Varnish, API Gateway, MediaWiki) but I think the first two would just apply anonymous-request-level rate limiting rather than reject the request, even if the JWT is expired. Not really sure about it though.

On the MediaWiki level, we did some big upgrades recently (T261462: Migrate OAuth extension back from wikimedia/oauth2-server fork to upstream) so that could also have caused a regression.

wait four hours for the access token to expire

Quick way is:

  • get into a production MediaWiki shell (mwscript shell xxwiki, doesn't matter which wiki)
  • decrypt the access token with MWOA::decryptAccessToken( '<token>' )
  • change the exp field
  • re-crypt using MW::srv()->getJwtCodec()->create( <claims array> )

I'm not sure anything outside MediaWiki does JSON responses? Also the JWT is validated in three layers (Varnish, API Gateway, MediaWiki) but I think the first two would just apply anonymous-request-level rate limiting rather than reject the request, even if the JWT is expired. Not really sure about it though.

On the MediaWiki level, we did some big upgrades recently (T261462: Migrate OAuth extension back from wikimedia/oauth2-server fork to upstream) so that could also have caused a regression.

I can’t find the string Jwt is expired anywhere in Codesearch (if it was in our PHP/composer dependencies, it should show up there), but it does show up in Envoy’s sources (kudos to @matmarex for finding another “Jwt” string in T417839#11630340). The response (I added some code to the server side to log it) also has the following headers:

server: envoy
server-timing: cache;desc="pass", host;desc="cp3068"
www-authenticate: Bearer realm="https://test.wikipedia.org/w/api.php?action=query&assert=user&crossorigin=&errorformat=html&format=json&formatversion=2&meta=tokens&type=csrf", error="invalid_token"
# (and others)

So basically, Envoy just needs to pass through requests with expired access tokens to MediaWiki? The Access-Control-Allow-Origin issue is specific to Envoy, wouldn't be an issue with the MediaWiki error response, right?

m3api-oauth2 can’t register a handler for this kind of error.

That might be a problem for 429 errors from Envoy in the future.

So basically, Envoy just needs to pass through requests with expired access tokens to MediaWiki? The Access-Control-Allow-Origin issue is specific to Envoy, wouldn't be an issue with the MediaWiki error response, right?

I think so, though I’m not 100% sure. But yeah, Envoy passing through the requests is probably the best option. (If ops are worried about passing these requests through to the app layer, maybe they could be rate limited more strictly than requests with valid tokens. But I don’t think trying to make Envoy return responses that sufficiently resemble MediaWiki’s would be a good idea ^^)

The other option would be to do what we are doing for session cookie JWTs already and differentiate between hard expiry (the exp field, which is understood by any JWT parser) and a custom soft expiry field. So MediaWiki would reject tokens over 4 hours old, but the edge would only reject them after 24 hours, or something like that.

m3api-oauth2 can’t register a handler for this kind of error.

That might be a problem for 429 errors from Envoy in the future.

I should probably swap these two blocks, making sure Retry-After handling happens before checking the HTTP status code?

The other option would be to do what we are doing for session cookie JWTs already and differentiate between hard expiry (the exp field, which is understood by any JWT parser) and a custom soft expiry field. So MediaWiki would reject tokens over 4 hours old, but the edge would only reject them after 24 hours, or something like that.

That would still break any application or tool where you might want to be logged in for more than 24 hours, if I understand correctly.

Yeah it would break tools with more than 24 hours between requests.

So basically, Envoy just needs to pass through requests with expired access tokens to MediaWiki? The Access-Control-Allow-Origin issue is specific to Envoy, wouldn't be an issue with the MediaWiki error response, right?

That's a problem - one stated goal of the API ratelimiting endevour is to protect non-mediawiki backends. At least for these, it's Envoy's job to reject the request that contains the expired token.

We could, for now, allow it, as a stop-gap measure. But it's not great in the long run.

If ops are worried about passing these requests through to the app layer, maybe they could be rate limited more strictly than requests with valid tokens.

The token payload would be ignored, so the request would be treated as unauthenticated.

In the client-side example, this will result in a CORS error because the response does not have an Access-Control-Allow-Origin header. m3api’s code can’t even see the response.

Yeah, this is rough. I just filed T418969 about the same problem for the rate-limited requests. The solution I suggested there would probably work here as well (let through the preflight OPTIONS request, and add some headers to the GET/POST request's response)… it should be fine to let the preflight OPTIONS request happen, right?


For the rest of the problem, I think it's something you'll have to solve on the side of the API client. Handling the successful API error responses from MediaWiki with HTTP 200 status is one thing, handling the error responses from something other than MediaWiki is another thing (and this could be our Envoy proxy, but also some Apache error page due to misconfiguration, a captive WiFi portal [if your library supports non-HTTPS sites], etc.). You're probably already handling other kinds of errors too, like the network connection being down.

I should probably swap these two blocks, making sure Retry-After handling happens before checking the HTTP status code?

That seems right at a glance.

In the client-side example, this will result in a CORS error because the response does not have an Access-Control-Allow-Origin header. m3api’s code can’t even see the response.

Yeah, this is rough. I just filed T418969 about the same problem for the rate-limited requests. The solution I suggested there would probably work here as well (let through the preflight OPTIONS request, and add some headers to the GET/POST request's response)… it should be fine to let the preflight OPTIONS request happen, right?

IIRC the OPTIONS request already went through fine (since that doesn’t actually include the Authorization header value, so the edge can’t tell it’s expired), only the GET’s response didn’t have the right header.

For the rest of the problem, I think it's something you'll have to solve on the side of the API client. Handling the successful API error responses from MediaWiki with HTTP 200 status is one thing, handling the error responses from something other than MediaWiki is another thing (and this could be our Envoy proxy, but also some Apache error page due to misconfiguration, a captive WiFi portal [if your library supports non-HTTPS sites], etc.). You're probably already handling other kinds of errors too, like the network connection being down.

Not really, since I don’t need to treat most errors as recoverable. If there’s a network error, I just let it bubble up. The problem in this task is that “expired token” is an error with a very clear recovery path, but WMF infra has made it harder to detect the error condition and distinguish it from other problems.

More generally speaking – I’m writing a MediaWiki Action API library that sends requests to the MediaWiki Action API and expects to get MediaWiki responses. It’s difficult when administrators somewhere along the request path decide to inject other software with its own arbitrary non-MediaWiki responses… I’m not sure how I’m expected to know about possible errors or how to correctly react to them.

I should probably swap these two blocks, making sure Retry-After handling happens before checking the HTTP status code?

That seems right at a glance.

Thanks, done (though not released yet).

Side note: I tried to see if there’s a standard way to signal expired access tokens in a response, but it doesn’t sound like it; RFC 6749 intentionally omits this:

1.5. Refresh Token

(F)  Since the access token is invalid, the resource server returns
     an invalid token error.

Steps (C), (D), (E), and (F) are outside the scope of this
specification, as described in Section 7.

7. Accessing Protected Resources

The methods used by the resource
server to validate the access token (as well as any error responses)
are beyond the scope of this specification

So as far as I can tell, expired access tokens are in the realm of application-specific errors; m3api-oauth2 has a handler for MediaWiki’s error in this case, but not for that now generated by (IIUC) Envoy. But if there’s a standard response structure that I missed, let me know – in that case it would probably make a lot of sense to make the WMF infra emit that and add support for it to libraries.

I think the relevant spec is RfC 6750:

If the protected resource request does not include authentication
credentials or does not contain an access token that enables access
to the protected resource, the resource server MUST include the HTTP
"WWW-Authenticate" response header field; it MAY include it in
response to other conditions as well.
[...]
 All challenges defined by this specification MUST use the auth-scheme
value "Bearer".
[...]
If the protected resource request included an access token and failed
authentication, the resource server SHOULD include the "error"
attribute to provide the client with the reason why the access
request was declined.  The parameter value is described in
Section 3.1.  In addition, the resource server MAY include the
"error_description" attribute to provide developers a human-readable
explanation that is not meant to be displayed to end-users.  It also
MAY include the "error_uri" attribute with an absolute URI
identifying a human-readable web page explaining the error.
[...]
For example [...] in response to a protected resource request with an
authentication attempt using an expired access token:

  HTTP/1.1 401 Unauthorized
  WWW-Authenticate: Bearer realm="example",
                    error="invalid_token",
                    error_description="The access token expired"

No idea whether we actually implement that (either in Envoy or in MediaWiki).

BTW @daniel should we exempt the /oauth2/access_token endpoint (which is used to get a new token) somehow? IIRC at least the Python standard OAuth implementation (ouathlib) will send an access token in the request in which it is using a refresh token to get a new access token. Maybe it's smart enough to not do that when the refresh is triggered by an expired access token, but I doubt it.

I think the relevant spec is RfC 6750:

No idea whether we actually implement that (either in Envoy or in MediaWiki).

Interesting – I think Envoy does follow it? There’s a www-authenticate response header in T419034#11674523. (error="invalid_token" doesn’t exactly scream “you must refresh” to me, but I suppose it’s no less informative than mwoauth-invalid-authorization.)

BTW @daniel should we exempt the /oauth2/access_token endpoint (which is used to get a new token) somehow? IIRC at least the Python standard OAuth implementation (ouathlib) will send an access token in the request in which it is using a refresh token to get a new access token. Maybe it's smart enough to not do that when the refresh is triggered by an expired access token, but I doubt it.

FWIW, when I was recently refactoring m3api-oauth2, I had to make sure it didn’t send an access token with the refresh request (here), otherwise the refresh request would fail. So if oauthlib does what you say, I suspect it’s already broken. (But I didn’t save any details on how the refresh request with expired access token failed. I think it was a MediaWiki-level error – this was probably before Envoy started checking these tokens at all – but that’s all I’ve got.)

BTW @daniel should we exempt the /oauth2/access_token endpoint (which is used to get a new token) somehow? IIRC at least the Python standard OAuth implementation (ouathlib) will send an access token in the request in which it is using a refresh token to get a new access token. Maybe it's smart enough to not do that when the refresh is triggered by an expired access token, but I doubt it.

Excempt it from rate limiting, or from jwt_authn? The former would be doable, but I have no idea how to do the latter...

I’m writing a MediaWiki Action API library that sends requests to the MediaWiki Action API and expects to get MediaWiki responses.

You are also implementing an OAuth2 client...

Interesting – I think Envoy does follow it? There’s a www-authenticate response header in T419034#11674523. (error="invalid_token" doesn’t exactly scream “you must refresh” to me, but I suppose it’s no less informative than mwoauth-invalid-authorization.)

Yes, and I'm now adding e2e tests to make sure :)

The specs says about it:

invalid_token
       The access token provided is expired, revoked, malformed, or
       invalid for other reasons.  The resource SHOULD respond with
       the HTTP 401 (Unauthorized) status code.  The client MAY
       request a new access token and retry the protected resource
       request.

Since the token comes from the server, the client can always just ask for a new one when it sees that code. I suppose there is no need to distinguish the possible reasons for the token to be invalid. There should be a limit on the number of retries of course.

Change #1248398 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: test: check for www-authenticate error

https://gerrit.wikimedia.org/r/1248398

Excempt it from rate limiting, or from jwt_authn? The former would be doable, but I have no idea how to do the latter...

If it's easier to just exempt it wholesale, I think it's fine to do that - the API gateway is responsible for discouraging abuse of our resources, not DDoS protection (which happens at the edge), and you can't really use the access token endpoint in an abusive way (and most scrapers aren't going to use it at all).

Change #1248398 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: test: check for www-authenticate error

https://gerrit.wikimedia.org/r/1248398

By the way, I discovered that MediaWiki without a gateway in front of it has basically the same bug: T421778#11767754. You won't encounter this on Wikimedia wikis due to the gateway producing its own errors, so fixing that doesn't block this task, but I submitted a patch.

(This task isn’t related to confidential vs. non-confidential or browser-side vs. non-CORS-limited clients though, it affects everyone as far as I’m aware.)

I’ve now updated m3api-oauth2 to handle non-MediaWiki HTTP-level errors with WWW-Authenticate response headers, in m3api v1.1.0 + m3api-oauth2 v1.0.4 (also v1.0.5). As far as I can tell, this was successful – this edit automatically refreshed the access token (I added some console.log()s in node_modules/ to make it visible).

That said, if this task isn’t going to be fixed by restoring the previous behavior, IMHO the OAuth extension should be updated to emit a similar error as Wikimedia infrastructure (i.e. HTTP 401 + WWW-Authenticate header; it’s probably not necessary to exactly replicate the JSON response body that Envoy produces), so that the behavior is at somewhat more consistent between Wikimedia and other OAuth-enabled wikis.

That said, if this task isn’t going to be fixed by restoring the previous behavior, IMHO the OAuth extension should be updated to emit a similar error as Wikimedia infrastructure (i.e. HTTP 401 + WWW-Authenticate header; it’s probably not necessary to exactly replicate the JSON response body that Envoy produces), so that the behavior is at somewhat more consistent between Wikimedia and other OAuth-enabled wikis.

I'm not sure if that would be helpful. It's bad enough that we broke compatibility on Wikimedia wikis; I don't see benefit in breaking it on all other wikis also. If you're a library author, then realistically you're going to want to support both kinds of errors anyway…

Apart from that, can we consider this resolved?

I'm not sure if that would be helpful. It's bad enough that we broke compatibility on Wikimedia wikis; I don't see benefit in breaking it on all other wikis also.

This needn’t be a breaking change. API client libraries should already be prepared for non-200 response statuses from the API (it’s uncommon has been possible for a long time, see e.g. T150344, and especially WikiLambda does it quite a bit), and the added header shouldn’t break anything either. (The body should stay the same, so detecting the MediaWiki-level mwoauth-invalid-authorization would remain possible.)

If you're a library author, then realistically you're going to want to support both kinds of errors anyway…

And how would you find out that both kinds of errors exist if Wikimedia only ever responds with one kind? Not everyone is going to test their code against other wikis, I think.