Page MenuHomePhabricator

Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid."
Closed, ResolvedPublicBUG REPORT

Description

What I did

  • Visit a page on mediawiki.org
  • Clicked the link to sign-in
  • Entered my login credentials (on a sub-URL of auth.wikimedia.org)
  • Entered my 2FA code

What happened?
I got shown the message "The provided authentication token is either expired or invalid.", with no clear way forward to rectify the issue/no clear steps on what to do next, if an error had occurred. (Confusingly, in the top-right of the screen, I do appear as being logged-in?)

screenshot.png (1×2 px, 343 KB)

What should have happened instead?:
Either the login completes successfully and I'm redirected back to the page I was originally on, or the end user is given steps on how they can resolve whatever issue has taken place/advice on what they should do next.

Other information
This happened to me around 20:10 UTC on 2025-04-01, in case that's helpful in finding any logs. I'm happy to privately share the auth.wikimedia.org URL that I was taken to (after entering my 2FA code) which displayed this message - let me know.

Event Timeline

Did you spend a lot of time on the login page? (I think "a lot of time" would be 60+ minutes, but maybe I'm wrong and the limit is just 5 or 10. Would have to check the sessionstore expiration spec.)

(And yeah the error message should be improved, even if this is just a timeout.)

Did you spend a lot of time on the login page? (I think "a lot of time" would be 60+ minutes, but maybe I'm wrong and the limit is just 5 or 10. Would have to check the sessionstore expiration spec.)

Nope - my browser history has me viewing a page on MW.org at 20:08, on the auth.wikimedia.org login page at 20:08, and at the error message page at 20:10.

Thanks, that's very helpful information.

The fact that you are logged in is expected. This message is shown if you log in successfully on auth.wikimedia.org but when you return to the wiki you came from, it doesn't remember you started a login.

Could maybe happen if you start another login process in another browser tab (since the "remembering" part is done via the local session).

Thanks, that's very helpful information.

The fact that you are logged in is expected. This message is shown if you log in successfully on auth.wikimedia.org but when you return to the wiki you came from, it doesn't remember you started a login.

Could maybe happen if you start another login process in another browser tab (since the "remembering" part is done via the local session).

AFAIK, I did the entire login flow in the same tab.

For what it's worth, I couldn't immediately reproduce this again in a private window just now. Maybe the logs might show something interesting about what could have happened here, but I'll leave that to someone else (with log access) to determine!

Did you spend a lot of time on the login page? (I think "a lot of time" would be 60+ minutes, but maybe I'm wrong and the limit is just 5 or 10. Would have to check the sessionstore expiration spec.)

I just tested this waiting 20+ minutes on the login page in production, and the login and redirect both succeeded. I did the same thing locally, with the same results. I can only cause the error by editing the 'centralauthLoginToken' in the URL or purging caches locally.

We log a warning when this error is shown: https://gerrit.wikimedia.org/g/mediawiki/extensions/CentralAuth/+/c1db3aa9bdf65e6b7afc35050e3d724cd46fbc35/includes/Hooks/Handlers/RedirectingLoginHookHandler.php#96
…and these are the logs: https://logstash.wikimedia.org/goto/fbe038b384ab7130908b190b2929e80c

image.png (266×1 px, 34 KB)

…so this does happen to people fairly regularly, it wasn't just you.

There doesn't seem to be anything specific about the requests, they're for different wikis, login and signup, desktop and mobile. The tokens aren't obviously fake.

So there are two things we could do:

  • Investigate why the tokens are missing. They're stored in MicroStash and they're not supposed to just disappear. The expiry is set to 1 day (see here and here) – I checked a few log entries at random, and the same token did not appear in any logs older than 1 day. Is there a backend problem, or are we not writing them in the first place? or is someone sending requests with fake/expired tokens?
  • Since the login succeeded, and we know which wiki the user tried to log into… maybe we could just initiate a top-level autologin when this happens?

Change #1135484 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] CentralAuthTokenManager: Log failures for write operations

https://gerrit.wikimedia.org/r/1135484

That should answer at least one of these questions.

Right, the session timeout used to be 60 minutes but got bumped to a day during the Redis -> Cassandra migration as Cassandra couldn't handle multiple expiries and central sessions expire in a day.

If the increase of the errors correlates with the WaitConditionLoop removal, that would indicate a race condition, but IIRC that happened later?

Could be some cross-DC thing maybe - I don't know if there's a good way to test that, the request where this is logged is POST and always goes to the primary. If the local Special:UserLogin request that redirects to the central domain always happens on the passive DC, that would be indicative (if the user is logged in centrally, the central domain will immediately redirect back, so the tokenstore write and read can happen quickly after each other), but I don't think there's an easy way to correlate that.

Microstash is cross-DC memcached, right? So in theory not the most reliable - could be evicting data because the slab is full. It does seem unlikely though.

Microstash is cross-DC memcached, right? So in theory not the most reliable - could be evicting data because the slab is full. It does seem unlikely though.

Yeah. MicroStash in production is backed by memcached today. It's not very reliable in the centralauthLoginToken use-case because since memcached uses the LRU algorithm, things persist based on popularity and expiry time is technically not enforced.

So that means, the tokens can get wiped out before their expiry is even reached making things disappear and running into this kind of issue. I made a patch to potentially resolve the issue. @Tgr / @matmarex, you can have a look.

Change #1135511 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/CentralAuth@master] SUL3: Instruct user in error message what to do next on failed login

https://gerrit.wikimedia.org/r/1135511

Change #1135484 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] CentralAuthTokenManager: Log failures for write operations

https://gerrit.wikimedia.org/r/1135484

Microstash is cross-DC memcached, right? So in theory not the most reliable - could be evicting data because the slab is full. It does seem unlikely though.

Yeah. MicroStash in production is backed by memcached today. It's not very reliable in the centralauthLoginToken use-case because since memcached uses the LRU algorithm, things persist based on popularity and expiry time is technically not enforced.

I have no idea what level of reliability I should expect from it. But… it looks like we have ~2k of these errors per day (https://logstash.wikimedia.org/goto/4ad6a332961e38bb57eb2c768ad48ac4), and about ~60k successful non-API logins per day (https://logstash.wikimedia.org/goto/8e984237a7be09c3a56868375992bb49. Assuming that I'm pulling these numbers from the right place, and that they're not incorrect, we would we have some 3% of logins failing with this error, caused by 3% of writes to MicroStash/Memcached magically disappearing, which would seem pretty bad.

MicroStash in production is backed by memcached today. It's not very reliable in the centralauthLoginToken use-case because since memcached uses the LRU algorithm, things persist based on popularity and expiry time is technically not enforced.

In theory, since tokens are meant to be used almost immediately, LRU eviction is a good fit. But maybe the store gets overwhelmed so evictions happen very quickly, and the 3% slowest-to-use-the-token workflows fail? Not sure if there's a good way to check at scale how long it took between the tokenize and detokenize calls for the failing requests, but maybe checking a few random samples is worth it.

Change #1135993 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.24] CentralAuthTokenManager: Log failures for write operations

https://gerrit.wikimedia.org/r/1135993

Change #1135993 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.24] CentralAuthTokenManager: Log failures for write operations

https://gerrit.wikimedia.org/r/1135993

Mentioned in SAL (#wikimedia-operations) [2025-04-14T13:14:52Z] <samtar@deploy1003> Started scap sync-world: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-14T13:19:47Z] <samtar@deploy1003> samtar, matmarex: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-14T13:26:32Z] <samtar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] (duration: 11m 39s)

@Krinkle suggested in code review that the new logging might be already covered by existing logging for Memcached failures, and so far they indeed seem the same:
https://logstash.wikimedia.org/goto/695a43b57676d790d9de06997b36715e
https://logstash.wikimedia.org/goto/4b1e56ddb8021b2e55ebd9797e27ea28

Good news about this is that we now know where we can look up the historical data for these failures, which have an interesting pattern:
https://logstash.wikimedia.org/goto/fb749ac2b62ccc328ab252ff8a6495b0

image.png (255×1 px, 38 KB)

There were very few errors between 19 and 26 March, which lines up suspiciously well with T385155: 🧭 Northward Datacentre Switchover (March 2025) . Apart from that the growth in error rate matches the SUL3 deployments (https://www.mediawiki.org/wiki/MediaWiki_Platform_Team/SUL3#Phased_rollout).

…but why does the shape of that chart not match the one from T390784#10727428?

image.png (266×1 px, 46 KB)

Tgr added a subscriber: Huji.

I would argue that the phrase "The provided authentication token is either expired or invalid." does not mean "something went wrong with the redirect back to the local wiki" and so a separate message should be created and displayed in such circumstances.

We now get about 1000 of this on a good day, plus the occasional spike:

Screenshot Capture - 2025-04-26 - 16-32-45.png (448×1 px, 52 KB)

logstash

Log message, so Phatality can find this: Failed to set {keyPrefix} token {token}

@Krinkle says we should not store anything in the Microstash for more than a few seconds, and we should read this data out and move it to the session. I'll have a look at that.

The errors here are write errors (seems like we aren't logging reads at all). And they almost all happen on the local domain - there are e.g. 7000 errors for centralauth-sul3-start but only 3 for centralauth-sul3-complete in the last 7 days. Which is pretty surprising since there should be roughly equal volume of both, unless 99.95% of the traffic to the login page bounces - I'm sure a lot does due to scrapers, but that seems extreme.

Which is pretty surprising since there should be roughly equal volume of both, unless 99.95% of the traffic to the login page bounces - I'm sure a lot does due to scrapers, but that seems extreme.

We have direct data for that, from T377261: Track the number of interrupted SUL3 logins / signups, and unfortunately this is roughly accurate. 8M and 6M login and signup attempts started in the last 24 hours, respectively; 30K and 5K finished. So about 99.6% of login page visits and 99.9% of signup page visits are from scrapers.

(The login dashboard shows about 30K successful and 30K failed login attempts in the last 24 hours, not counting bots still using the local domains. So it's more like 99.3%.)

Change #1139572 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] SUL3: After redir to shared domain, transfer tokenized data to the session

https://gerrit.wikimedia.org/r/1139572

Change #1147877 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] Revert "CentralAuthTokenManager: Log failures for write operations"

https://gerrit.wikimedia.org/r/1147877

Change #1147905 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] SUL3: Retry local login on failure… (follow-ups)

https://gerrit.wikimedia.org/r/1147905

Change #1139572 abandoned by Bartosz Dziewoński:

[mediawiki/extensions/CentralAuth@master] SUL3: After redir to shared domain, transfer tokenized data to the session

Reason:

This turns out to be more difficult to do than it seemed, so Derick's patch is clearly the better approach.

https://gerrit.wikimedia.org/r/1139572

Change #1147877 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Revert "CentralAuthTokenManager: Log failures for write operations"

https://gerrit.wikimedia.org/r/1147877

Change #1135511 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] SUL3: Retry local login on failure due to invalid/expired login token

https://gerrit.wikimedia.org/r/1135511

Change #1147905 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] SUL3: Retry local login on failure… (follow-ups)

https://gerrit.wikimedia.org/r/1147905

We should backport the last two patches to both active branches, at the same time, so that we don't get issues where part of the login flow supports the new parameters, and another part doesn't.

Change #1153689 had a related patch set uploaded (by Bartosz Dziewoński; author: Derick Alangi):

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.3] SUL3: Retry local login on failure due to invalid/expired login token

https://gerrit.wikimedia.org/r/1153689

Change #1153690 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.3] SUL3: Retry local login on failure… (follow-ups)

https://gerrit.wikimedia.org/r/1153690

Change #1153691 had a related patch set uploaded (by Bartosz Dziewoński; author: Derick Alangi):

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.4] SUL3: Retry local login on failure due to invalid/expired login token

https://gerrit.wikimedia.org/r/1153691

Change #1153692 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.4] SUL3: Retry local login on failure… (follow-ups)

https://gerrit.wikimedia.org/r/1153692

Change #1153689 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.3] SUL3: Retry local login on failure due to invalid/expired login token

https://gerrit.wikimedia.org/r/1153689

Change #1153690 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.3] SUL3: Retry local login on failure… (follow-ups)

https://gerrit.wikimedia.org/r/1153690

Change #1153691 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.4] SUL3: Retry local login on failure due to invalid/expired login token

https://gerrit.wikimedia.org/r/1153691

Change #1153692 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.45.0-wmf.4] SUL3: Retry local login on failure… (follow-ups)

https://gerrit.wikimedia.org/r/1153692

Mentioned in SAL (#wikimedia-operations) [2025-06-04T20:51:58Z] <cjming@deploy1003> Started scap sync-world: Backport for [[gerrit:rGERRIT11536892efba|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690|SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692|SUL3: Retry local login on failure… (follow-ups) (T390784)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-04T20:54:09Z] <cjming@deploy1003> matmarex, cjming: Backport for [[gerrit:rGERRIT11536892efba|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690|SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692|SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to

Mentioned in SAL (#wikimedia-operations) [2025-06-04T21:02:41Z] <cjming@deploy1003> Finished scap sync-world: Backport for [[gerrit:rGERRIT11536892efba|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690|SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692|SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d

We had a discussion in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1139572 about using the session rather than the microstash to store data during authentication, which wouldn't work since a shared login domain means you can easily have login pages from different wikis open in different tags, those will need different data (e.g. the anti-session-fixation secret), and session writes will overwrite each other. As it happens, in Q1 we'll be looking at two tasks that can help with that:

So maybe this should be revisited then. The user-impacting part of the problem is mostly solved with the recent patches, but it would still be nice to be able to rely on logins (as seen by local MediaWiki) actually finishing.