Page MenuHomePhabricator

Neverending edge login
Closed, ResolvedPublic

Description

Steps to reproduce:

  • log in on en.wikipedia.org (or another domain) with SUL2 login.
  • refresh the page a couple times, or navigate around

Every request will invoke another edge login (which is like 70 requests, mostly uncacheable).

Didn't see the effect after SUL3 login.

Noticed after deploying rECAU2178b5c46d52: Do not trigger edge login on the shared domain and rECAUe59ee728aca7: Do not initiate central login on the passive central domain, but it can't possible related to those two patches since they only reduce the number of edge logins. The authentication dashboard shows no changes and if this affected all users, there would be a massive increase in traffic... no idea what I'm triggering it with.

The final step of a successful autologin (/setCookies) schedules an edge login, so the logical assumption would be that that causes a loop, but edge login doesn't make any /setCookies calls to the same domain it was started from (I verified that in the network console) so I don't see how it would happen.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Tgr triaged this task as High priority.Mar 17 2025, 10:10 PM
Tgr raised the priority of this task from High to Unbreak Now!.

If this happens often that will result in a catastrophic amount of traffic so we need to figure out quickly what's going on.

Logstash (which I trust more for numbers than Graphite/Prometheus) says there have been 100M edge login attempts in the last 90 days, about 80% of them successful. So about a million per day. Seems quite high, we have a few 100K active editors. Maybe this has actually been going on for a long time? But then I think I would have noticed this during earlier tests...

(Emergency action if this is about to happen to more users: set $wgCentralAuthAutoLoginWikis to an empty array to disable edge login entirely. That doesn't affect other kinds of autologin, so the user impact is not that large.)

Change #1128550 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@master] Do not schedule edge login recursively

https://gerrit.wikimedia.org/r/1128550

Change #1128550 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Do not schedule edge login recursively

https://gerrit.wikimedia.org/r/1128550

Change #1128560 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.20] Do not schedule edge login recursively

https://gerrit.wikimedia.org/r/1128560

Change #1128560 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.20] Do not schedule edge login recursively

https://gerrit.wikimedia.org/r/1128560

Mentioned in SAL (#wikimedia-operations) [2025-03-17T23:43:34Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-17T23:47:26Z] <tgr@deploy2002> tgr: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-17T23:59:09Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] (duration: 15m 35s)

Tgr closed this task as Resolved.EditedMar 18 2025, 12:01 AM
Tgr claimed this task.
Tgr added a subscriber: matmarex.

@matmarex figured out what's going on:

  • rECAUb10a647882cd: Add passive central domain to edge login list adds the "passive central domain" to the edge login list. So when you do a SUL2 login on enwiki, there will be an edge login on https://auth.wikimedia.org/enwiki/wiki/...
  • Edge login bounces between the target wiki and the central domain (using a wikiid query parameter when on the central domain to remember where to return to). In this case, it is supposed to bounce between the passive central domain and the active central domain. But when the passive central domain is the SUL3 shared domain, that doesn't have its own wiki ID - the wiki ID will just be enwiki. So the edge login bounce will go enwiki -> auth.wikimedia.org -> loginwiki -> enwiki -> loginwiki -> enwiki. Basically, enwiki (the wiki the user is acutally visiting in the browser) started doing an edge login to itself.
  • At the end of a successful edge login, we schedule another edge login for that domain. (I never understood why, but it was done that way since SUL1 times.) So the edge login was rescheduling itself forever.
  • This patch was deployed two weeks ago, but didn't actually work because a different bug that caused the edge login on the shared domain to fail immediately. We fixed that bug today (not sure when exactly - rOMWCc556b827d134: Re-apply "Fix some SUL3 shared domain settings" would be the logical candidate, but I'm very sure I started noticing the extra edge logins before deploying that patch) so this bug became active.

edge login doesn't make any /setCookies calls to the same domain it was started from (I verified that in the network console)

I guess I must have messed this up somehow, it very much did make a /setCookie call to the same domain.

The authentication dashboard shows no changes and if this affected all users, there would be a massive increase in traffic..

It took a while (I guess because users had to trigger the first edge login in the normal way) but eventually edge login counts did go up. They are back to normal now.

Screenshot Capture - 2025-03-18 - 01-09-26.png (1,884×1,610 px, 446 KB)