Page MenuHomePhabricator

Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag
Closed, ResolvedPublicBUG REPORT

Description

Summary:
This problem was related to encoded versions of the title parameter in the ATS DC routing logic. Some clients URL-encode : and / characters. Those requests weren't forced to the primary DC, which caused race conditions in reading & writing the token cache (main stash). It's unclear why this only started happening after the DC switchover.

The patch added variants, such as Special%3AOAuth%2Ftoken (for Special:OAuth/token). The patch also added a missing /authenticate OAuth 1.0 endpoint


Original task description:
Several external tools appear to be having issues with logging in with OAuth. All of the following will eventually work, but it might require N tries before you're actually logged in:

I only have access to CopyPatrol and was able to deduce it's OAuth that's failing. The JSON response we get back is:

{
  "error": "mwoauthdatastore-bad-token",
  "message": "No token was found matching your request."}
}

Make enough attempts and you will get back a token. This matches the reports for The Wikipedia Library (T332349#8707537) and other OAuth clients (T332429). These seem very similar and hint at a broader issue.

Interestingly, other tools like Global Search and XTools seem to have no issue with logging in, and they're using the same package as CopyPatrol (mediawiki/oauthclient). All are using OAuth 1.0a as far as I can tell.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Can confirm the Wikipedia Library's error is the same: Error: No token was found matching your request.

When I tested this I got "This request has already been completed and cannot be resubmitted" (mwoauthdatastore-request-token-already-used) instead.

mwoauthdatastore-bad-token means the request token failed to verify, mwoauthdatastore-request-token-already-used means the user visited the same URL (with the same request token reference in it) which are in theory strictly client-side errors. We have some logging for these, although it's not very informative. We should figure out how to log the consumer ID at least.

I see these errors in the logs for the last 24h:

  1. The authorization headers in your request are for a user that does not exist here x2,879
  2. No approved grant was found for that authorization token. x726
  3. No token was found matching your request. x247
  4. Sorry, something went wrong connecting this application. Go back and try to connect your account again, or contact the application author. x22
  5. This request has already been completed and cannot be resubmitted. Go back to the application and try to connect your account again, or contact the application author. x21
  6. Unknown OAuth key x20
  7. Invalid consumer x6
  8. XXX is an owner-only Connected App. x11
  9. OAuth callback URL not found in cache. This is probably an error in how the application makes requests to the server. x5
  10. Invalid consumer key x1
  11. The verification code provided was not valid. x1

(What's the difference between "Unknown OAuth key", "Invalid consumer" and "Invalid consumer key"? Beats me.)

#1 is a possible cause for some "only works on the second try" errors: the local user does not get autocreated in time (T208443: User cannot log in with OAuth on a wiki before visiting that wiki directly), somehow in the process of getting an error the account gets autocreated, next time everything is OK. (That would imply that a user can only ever get the "only works on second try" behavior once per wiki) But most tools link to meta, everyone should have an account there, and the error message doesn't match anyway.

#2 seems like something that could be cause by replag (user approves the app and that creates a grant record, but it's not seen when the app does the final handshake step) so again an "only works the second time" approach would make sense. (That would imply that a user can only ever get the "only works on second try" behavior once per app.) But we do primary fallbacks, there is some special-casing to force these requests to go to the primary DC (although maybe that could have gotten confused by the DC switch), and in any case this error would happen after the handshake, when the app tries to sign an API request.

#3 is the issue in the task description. There was a massive jump in its frequency on the 14th around 11h:

Screenshot Capture - 2023-03-20 - 20-47-12.png (568×1 px, 50 KB)
which pretty cleanly correlates with T331541: 14 March 2023 eqiad Service repooling (and nothing else that I can see). I don't immediately see how this would result in a "works on the second try" behavior (but could just be something stochastic with a low probabiliy that is unlikely to happen twice in a row), nor how it would be caused by changes to services - the only service involved is the cache where the request token is stored, but if that fails, that should result in mwoauthdatastore-request-token-not-found as far as I can see.

The stack trace is

from /srv/mediawiki/php-1.40.0-wmf.27/extensions/OAuth/src/Backend/MWOAuthDataStore.php(244)
#0 /srv/mediawiki/php-1.40.0-wmf.27/extensions/OAuth/src/Backend/MWOAuthServer.php(266): MediaWiki\Extension\OAuth\Backend\MWOAuthDataStore->new_access_token(MediaWiki\Extension\OAuth\Backend\MWOAuthToken, MediaWiki\Extension\OAuth\Backend\OAuth1Consumer, string)
#1 /srv/mediawiki/php-1.40.0-wmf.27/extensions/OAuth/src/Frontend/SpecialPages/SpecialMWOAuth.php(195): MediaWiki\Extension\OAuth\Backend\MWOAuthServer->fetch_access_token(MediaWiki\Extension\OAuth\Backend\MWOAuthRequest)
#2 /srv/mediawiki/php-1.40.0-wmf.27/includes/specialpage/SpecialPage.php(701): MediaWiki\Extension\OAuth\Frontend\SpecialPages\SpecialMWOAuth->execute(string)
...

which is not what I would have expected - mwoauthdatastore-bad-token is thrown in two different places, getting to this one means looking up the request token in the cache was successful; but then either MWOAuthToken::getVerifyCode() or MWOAuthToken::getAccessKey() returns empty.

I believe the workflow is something like this: leg 1 (an app -> MediaWiki direct request) creates a request token with empty code and access key and stores it in the cache; leg 2 (the user being sent to Special:OAuth for authorizing the app) updates the stored value with the code and access key; and leg 3 (another app -> MediaWiki direct request) looks up the request token and verifies those fields. So it seems the BagOStuff::set call in leg 2 is not seen by the BagOStuff::get call in leg 3. The bag uses $wgMWOAuthSessionCacheType which is db-mainstash in production.

Tgr renamed this task from OAuth sometimes doesn't give back the token and secret to Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag.Mar 21 2023, 5:50 AM
Tgr triaged this task as High priority.

All three legs are special-cased to go to the primary DC, but only if the correct format is used:

  • /w/index.php for Special:OAuth/initiate
  • /wiki/ for Special:OAuth/authorize
  • /w/index.php for Special:OAuth/token

(See T59500: Impossible to use https://www.mediawiki.org/wiki/Special:OAuth/initiate?format=&oauth_callback= style URL and T74186: Varnish: Mobile site redirect interferes with OAuth authorization process for background.)

So one thing to check is whether any applications get that wrong. Quarry and TWL do for authorize, which is the easier leg to check. (In theory initiate/token don't work at all with a pretty URL, but then no one actually tested that for a very long time, so who knows.) That means that they might end up in the secondary DC for leg 2 in which they update the cached value, but then read the cache in leg 3 in the primary DC, wich maybe happens before the update is propagated there, especially if the user is geographically closer to the primary DC. Although that doesn't explain why this bug wasn't happening before the DC switch.

Change 901333 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/puppet@production] multi-dc: Use primary for OAuth for both URL forms

https://gerrit.wikimedia.org/r/901333

All three legs are special-cased to go to the primary DC, but only if the correct format is used:

  • /w/index.php for Special:OAuth/initiate
  • /wiki/ for Special:OAuth/authorize
  • /w/index.php for Special:OAuth/token

(See T59500: Impossible to use https://www.mediawiki.org/wiki/Special:OAuth/initiate?format=&oauth_callback= style URL and T74186: Varnish: Mobile site redirect interferes with OAuth authorization process for background.)

So one thing to check is whether any applications get that wrong. Quarry and TWL do for authorize, which is the easier leg to check. (In theory initiate/token don't work at all with a pretty URL, but then no one actually tested that for a very long time, so who knows.) That means that they might end up in the secondary DC for leg 2 in which they update the cached value, but then read the cache in leg 3 in the primary DC, wich maybe happens before the update is propagated there, especially if the user is geographically closer to the primary DC. Although that doesn't explain why this bug wasn't happening before the DC switch.

If I'm reading this right, it looks like that should be a straightforward change to mwoauth.initiate() so that separate uri formats are used for initiate and and authorize. Testing now.

@Tgr
It looks like we're not directly requesting authorize in our app, we're getting redirected there on wiki:
when I login to my local library dev environment, I request:
http://localhost/oauth/login/?next=/users/my_library/&from_homepage=true
and then go through the following redirects:

https://meta.wikimedia.org/w/index.php?title=Special:UserLogin&uselang=en&returnto=Special:OAuth/authenticate&returntoquery=&oauth_consumer_key=<xxx>&oauth_token=<xxx>

https://meta.m.wikimedia.org/w/index.php?returnto=Special:OAuth/authorize&returntoquery=%26oauth_consumer_key%3D<xxx>%26oauth_token%3D<xxx>&title=Special:UserLogin&uselang=en

https://login.wikimedia.org/wiki/Special:CentralAutoLogin/checkLoggedIn?type=script&wikiid=metawiki&mobile=1&proto=https&mobile=1&return=1&returnto=Special:OAuth/authorize&returntoquery=%26oauth_consumer_key%3D<xxx>%26oauth_token%3D<xxx>

https://meta.m.wikimedia.org/wiki/Special:CentralAutoLogin/createSession?token=<xxx>&type=script&return=1&returnto=Special%3AOAuth%2Fauthorize&returntoquery=%26oauth_consumer_key%3D<xxx>%26oauth_token%3D<xxx>&proto=https&mobile=1

https://login.wikimedia.org/wiki/Special:CentralAutoLogin/validateSession?token=<xxx>&wikiid=metawiki&type=script&return=1&returnto=Special%3AOAuth%2Fauthorize&returntoquery=%26oauth_consumer_key%3D<xxx>%26oauth_token%3D<xxx>&proto=https&mobile=1

https://meta.m.wikimedia.org/wiki/Special:CentralAutoLogin/setCookies?type=script&return=1&returnto=Special%3AOAuth%2Fauthorize&returntoquery=%26oauth_consumer_key%3D<xxx>%26oauth_token%3D<xxx>&proto=https&mobile=1

https://meta.wikimedia.org/w/index.php?title=Special:OAuth/authorize&oauth_consumer_key=<xxx>&oauth_token=<xxx>

I'm not sure how we could do anything different on our end.

@jsn.sherman I guess that could explain why the failure is only during the first attempt: if the app is set up correctly, the initial redirect will be to https://meta.wikimedia.org/wiki/Special:OAuth/authorize?oauth_consumer_key=<xxx>&oauth_token=<xxx>, but then, if the user is not logged in on that wiki, after the login redirect chain they will end up https://meta.wikimedia.org/w/index.php?title=Special:OAuth/authorize&oauth_consumer_key=<xxx>&oauth_token=<xxx> and trigger the mainstash race issue. On the next attempt, since they are already logged in, there is no redirection.

For the mwoauthdatastore-bad-token issue, the patch above should fix things (or at least if it doesn't, then these errors aren't actually related to pretty vs non-pretty URLs). But mobile users will still be affected by T74186: Varnish: Mobile site redirect interferes with OAuth authorization process. I'll follow up in that task.

@Tgr I should have checked through the user agents immediately once we had better logging in place: it looks like mostly desktop users are impacted.

Considering that all of the impacted clients are oauth 1.0a, I wonder if moving to oauth 2.0 might resolve the issue for us.

Considering that all of the impacted clients are oauth 1.0a, I wonder if moving to oauth 2.0 might resolve the issue for us.

This specific one, yes. In general, I think since most tools use OAuth 1, it gets more attention (OAuth 2 was fully broken for a day just last week), even though in theory OAuth 2 is simpler and so easier to maintain on both sides.

Considering that all of the impacted clients are oauth 1.0a, I wonder if moving to oauth 2.0 might resolve the issue for us.

This specific one, yes. In general, I think since most tools use OAuth 1, it gets more attention (OAuth 2 was fully broken for a day just last week), even though in theory OAuth 2 is simpler and so easier to maintain on both sides.

I wondered this as well, but I saw OAuth 2.0 has some grant options when registering a consumer, and I wasn't sure what I needed. I read over the official OAuth 2.0 documentation and am still confused :/ If some clarity is shed on this, I vow to update mw:OAuth/For Developers and mw:Oauthclient-php with the relevant info :)

This also made me worry my client (PHP-based mediawiki/oauthclient) may not support OAuth 2.0?

I wondered this as well, but I saw OAuth 2.0 has some grant options when registering a consumer, and I wasn't sure what I needed. I read over the official OAuth 2.0 documentation and am still confused :/ If some clarity is shed on this, I vow to update mw:OAuth/For Developers and mw:Oauthclient-php with the relevant info :)

There is some terminology confusion that we sadly do not explain either on the wiki page or on the app registration page. Grants in OAuth 1 are called scopes in OAuth 2 (although our OAuth 2 registration page unhelpfully still says grant). OAuth 2 has a "grant type" concept which is entirely unrelated. Grant types determine what OAuth 2 flow you can use. In practice the default is almost always fine, the use cases for client credentials are mostly covered by the owner-only flag.

This also made me worry my client (PHP-based mediawiki/oauthclient) may not support OAuth 2.0?

Yeah, you'd need a different client. There are quite a few, OAuth 2 has much better library support in general, although probably not in Wikimedia-specific libraries.

Change 901333 merged by Ssingh:

[operations/puppet@production] multi-dc: Use primary for OAuth for both URL forms

https://gerrit.wikimedia.org/r/901333

@Tgr
I just wanted to followup on this in case either, a) we should be doing something different, or b) there's room for improving the docs:

@jsn.sherman I guess that could explain why the failure is only during the first attempt: if the app is set up correctly, the initial redirect will be to https://meta.wikimedia.org/wiki/Special:OAuth/authorize?oauth_consumer_key=<xxx>&oauth_token=<xxx>, but then, if the user is not logged in on that wiki, after the login redirect chain they will end up https://meta.wikimedia.org/w/index.php?title=Special:OAuth/authorize&oauth_consumer_key=<xxx>&oauth_token=<xxx> and trigger the mainstash race issue. On the next attempt, since they are already logged in, there is no redirection.
...

We're using the "identify only" oauth 1.0a flow described here:
https://www.mediawiki.org/wiki/OAuth/For_Developers

Applications which only need minimal privileges (have been registered as User identity verification only) can use /authenticate instead of /authorize in step 2. This works the same way, but the user will only see the authorization dialog if they have not authorized this application before; otherwise the authorization will silently succeed.

which is why we're following a different path than what you've described as correct. Are there more details/requirements for this flow that we missed, or something like that? Or maybe the oauth server extension should be using a different redirect, or this path was just overlooked for the cache configuration, etc?

I've just encountered the said problem: My first attempt to log in to the Library failed. Access token generation failed, please try logging in again. When I tried again, it worked.

The patch did not work: https://logstash.wikimedia.org/goto/c8c557ced11daa1919864c6b2f91e69a
Either it's not doing what it's supposed to do, or some of the requests incorrectly going to the secondary DC wasn't the cause of this issue.

which is why we're following a different path than what you've described as correct.

The point here is that the first and third leg of the OAuth handshake should use a raw URL (with index.php in it), and the second one shouldn't. Whether it's /authorize or /authenticate doesn't matter. But in any case it seems this problem has a different cause (or there is something wrong with the multi-DC routing logic, but it's as simple as you can get).

The patch did not work: https://logstash.wikimedia.org/goto/c8c557ced11daa1919864c6b2f91e69a
Either it's not doing what it's supposed to do, or some of the requests incorrectly going to the secondary DC wasn't the cause of this issue.

The linked Logstash query, shows most have these two things in common that stand out (and are not not a limit of your query):

  1. The hosts are all Codfw hosts, makes sense, but confirms that it is indeed only happening when requests are routed there.
  2. The URLs contain non-canonical and unneeded percent encoding, with the exact URL: /w/index.php?title=Special%3AOAuth%2Ftoken.

In addition, the volume of errors has not changed since March 20th, and I see ~0 matches in Logstash for the URL patterns that the above multi-dc routing patch adds. As such, I suggest reverting it, restoring the comments, and instead adding Special%3AOAuth%2Ftoken as a routed title= value.

I have verified that at least for CopyPatrol, the request does not contain percent-encoding. https://logstash.wikimedia.org/goto/160f11fe7172efc486434c7548489d8a is the most recent test, which as you can see uses /w/index.php?format=json&title=Special:OAuth/token. This is using Oauthclient-php 2.0 (the latest version released four days ago).

@Tgr Trying to login to Wikipedia Library I and a fellow Wikipedian get "Access token generation failed, please try logging in again." meaning to "See T332650 for more information". Can you help?

  1. The URLs contain non-canonical and unneeded percent encoding, with the exact URL: /w/index.php?title=Special%3AOAuth%2Ftoken.

In the last 30 days, there are 4600 occurrences of the encoded variant and 1000 occurrences of the unencoded variant. I'm not sure that reflects anything beyond the distribution of clients which do / do not encode (encodeURIComponent encodes / and : so it's not surprising the encoded form is common).

I see ~0 matches in Logstash for the URL patterns that the above multi-dc routing patch adds.

Without filtering for that specific error (storage race conditions at different steps result in different errors):

  • /wiki/Special:OAuth/initiate: 13
  • /wiki/Special:OAuth/authorize: 26k
  • /wiki/Special:OAuth/authenticate: 3k
  • /wiki/Special:OAuth/token: 66
  • title=Special:OAuth/initiate: 38
  • title=Special:OAuth/authorize: 73
  • title=Special:OAuth/authenticate: 725
  • title=Special:OAuth/token: 1.3k
  • title=Special%3AOAuth%2Finitiate: 375
  • title=Special%3AOAuth%2Fauthorize: 0
  • title=Special%3AOAuth%2Fauthenticate: 53
  • title=Special%3AOAuth%2Ftoken: 5k

(I added the /authenticate endpoint which we forgot about.)

Note that we only log errors, so this isn't necessarily indicative of total OAuth traffic.

Change 907937 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/puppet@production] multi-dc: Improve OAuth URL patterns for routing to primary

https://gerrit.wikimedia.org/r/907937

Would be nice to rely on the Lua plugin for URL decoding instead of checking for all parameter variations, but it appears unmaintained.

We have 3M total OAuth errors in the last 30 days but the most frequent normalized message only has 20K instances, so some log needs to be normalized. Probably not related to this issue, but should be improved, otherwise error volume is hard to track.

Change 907976 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/OAuth@master] Normalize logging of nonce error

https://gerrit.wikimedia.org/r/907976

Change 907976 merged by jenkins-bot:

[mediawiki/extensions/OAuth@master] Normalize logging of nonce error

https://gerrit.wikimedia.org/r/907976

I wanted to followup on the library side; when we try to complete the handshake here:
https://github.com/WikipediaLibrary/TWLight/blob/master/TWLight/users/oauth.py#L484

We're seeing two different kinds of exceptions raised in the client library:

  1. is the Error: No token was found matching your request. idp error response that we've previously mentioned
  2. is when query string doesn't reference the correct temp resource owner key and the client library bails out here

I hope that turns out to be useful in some way.

  1. is when query string doesn't reference the correct temp resource owner key and the client library bails out here

In theory that looks like a client-side problem, but then given that we have ongoing token store failures, it probably isn't. That would mean that the OAuth server is returning a success response after the user accepting the approval popup, but using a different request token from what the handshake was initiated with. As far as I can see that would be a bug somewhere in the server logic - if the token store lookup fails, it should return an error, not a success with a new token.

How frequent is this? If it's rare, maybe it's some sort of race condition where the user tried to log in multiple times in quick succession, or in multiple browser tabs.

How frequent is this? If it's rare, maybe it's some sort of race condition where the user tried to log in multiple times in quick succession, or in multiple browser tabs.

I'm not sure how this relates to the overall number of library logins but this has happened at least 3300 times since March 20th - it's definitely not rare.

How frequent is this? If it's rare, maybe it's some sort of race condition where the user tried to log in multiple times in quick succession, or in multiple browser tabs.

I'm not sure how this relates to the overall number of library logins but this has happened at least 3300 times since March 20th - it's definitely not rare.

That's true for the 'token not found' error; the key mismatch has only occurred 21 times since March 18.

I do wonder if we can identify some of the conditions under which this is occurring: eg, when this happens to people, are they always logged out from the wiki to begin with? I've still not been able to reproduce the issue, but I've usually already signed into meta before logging into the library. Are the impacted service providers all using the /authenticate path that we use, or is it also happening on the /authorize path?

I do wonder if we can identify some of the conditions under which this is occurring: eg, when this happens to people, are they always logged out from the wiki to begin with? I've still not been able to reproduce the issue, but I've usually already signed into meta before logging into the library. Are the impacted service providers all using the /authenticate path that we use, or is it also happening on the /authorize path?

Answered one of my own questions: oauthclient-php uses /authorize and that's also one of the paths Tgr reported errors for earlier, so it's independent of /authorize or /authenticate being used.

@Tgr Trying to login to Wikipedia Library I and a fellow Wikipedian get "Access token generation failed, please try logging in again." meaning to "See T332650 for more information". Can you help?

@jsn.sherman - I got this problem when I already was logged in to Wikipedia before logging in into the library. So I thought it is an OAuth/token problem.

@jsn.sherman - I got this problem when I already was logged in to Wikipedia before logging in into the library. So I thought it is an OAuth/token problem.

Well, that eliminates that as a variable too, then. Thanks!

Change 907937 merged by Ssingh:

[operations/puppet@production] multi-dc: Improve OAuth URL patterns for routing to primary

https://gerrit.wikimedia.org/r/907937

Mentioned in SAL (#wikimedia-operations) [2023-04-13T16:31:36Z] <sukhe> sudo cumin -b1 -s30 'A:cp-text' 'ats-backend-restart': T332650

Tgr claimed this task.
Tgr added a subscriber: ssingh.

Last 24h:

Screenshot Capture - 2023-04-14 - 10-59-26.png (560×1 px, 47 KB)

Thanks @Krinkle for figuring out the problem with the first version of the patch and @ssingh for handling the deployment!

Thanks a lot! Can we make a user notice in the TechNews about it?

  1. Hi, Re: User Notice - please could someone suggest Draft Wording? I.e. Is this just an informational announcement, or is there any action-item for tool-owners? IIUC, it is just informational, and a summary like this would be accurate - Please confirm or correct [Now added, and frozen for translation. I hope it's accurate enough.] :
  • For the last few weeks, some external tools had inconsistent problems with logging-in with OAuth. This has now been fixed.
  1. IIUC, It might be helpful to add a summary of the fix, into the top of the Description here (for folks visiting via Tech News). E.g. Something like:

This problem was related to encoded versions of the title parameter. Some clients URL-encode : and / characters. The patch added variants, such as Special%3AOAuth%2Ftoken (for Special:OAuth/token). The patch also added a missing /authenticate OAuth 1.0 endpoint

  1. @doctaxon You added the tag for (1.41.0-wmf.11; 2023-05-30) but I think that should be (1.41.0-wmf.5; 2023-04-17) ?

Woohoo! Thanks so much to Tgr and everyone else who helped get this fixed!