Page MenuHomePhabricator

invalid returnUrlToken on MediaWiki.org after logging out
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Login to English Wikipedia
  2. Navigate to https://www.mediawiki.org/, click Log In, get logged in automatically.
  3. Log out on English Wikipedia
  4. Navigate to https://www.mediawiki.org/wiki/Special:Watchlist

Result: invalid returnUrlToken at URL https://login.wikimedia.org/wiki/Special:CentralAutoLogin/checkLoggedIn?type=redirect&returnUrlToken=4f60ede81c0a1248&wikiid=mediawikiwiki&proto=https

Screenshot 2023-10-02 at 13.26.11.png (370×882 px, 19 KB)

Browser/version information
Chrome 117.0.5938.132

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I can't reproduce with those exact steps (tried in an incognito browser but I doubt it matters), the browser passes through checkLoggedIn but ends up on the mediawiki.org login page as expected. I'd imagine this is either a race condition or a failed memcache request, probably the latter (T342201: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} suggests it's happening at a significant rate).

We have T327046: Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikis about making such issues easier to detect.

I saw this error message (or something very similar) 2 times this month when visiting Special:UserLogin in a private window.

I've encountered this just now on mobile, viewing https://en.wikipedia.org/wiki/Main_Page, finding I'm apparently logged out. The login link takes me to https://en.wikipedia.org/wiki/Special:CentralAutoLogin/setCookies?type=redirect&returnUrlToken=xxxxx where it then leaves me stranded on a blank page with:

invalid returnUrlToken
Return to the previous page.

We have… metrics! :o https://grafana.wikimedia.org/d/000000004/authentication-metrics?viewPanel=38&orgId=1&from=now-2d&to=now

image.png (1×3 px, 602 KB)

If I'm reading that right, this error happens to someone 20-30 times per minute.

Had this several times on both desktop (macOS Safari) and mobile (iOS). Really hard to reproduce on purpose.

Errors have almost completely disappeared yesterday:

image.png (1×3 px, 324 KB)

I'm thinking this probably had the same root cause as T342201#9308357, and was fixed by @Joe's change there.

I'm not sure if the remaining rate of errors (~1/minute) is worth looking into.

Krinkle added a project: MW-on-K8s.

I'm neutral on whether to investigate it, but I we need to treat "invalid returnUrlToken" as a category or source of errors, not as a single concrete error.

To investigate it, I would recommend creating a task with a specific scope about concrete instance of it only, where there is a (likely) common cause based on reviewing a few samples. Or alternatively, a task to generally gather more data to be able to do so.

This task was about the spike in errors that turn out to be largely due to the mw-on-k8s rollout, which has been resolved.

I'm not sure if the remaining rate of errors (~1/minute) is worth looking into.

There are about 4500 top-level autologins per hour; about 15 of those are successful, and 15 fail with an invalid token error. Assuming the errors are evenly distributed, that's about 3 token errors per 1000 autologin attempts, which I think is not a big deal. We could rewrite the redirection flow per Tim's suggestion on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/877987/ but given that soon we'll have to rewrite the whole thing anyway, I'm inclined to wait it out.

That begs the question though, how come the success rate is so tiny? This is normal for subresource autologin (which happens on every anon visit, and most of those users are not centrally logged in), but top-level autologin only happens when someone clicks the login link. That that would be done by centrally-not-logged-in users 99.7% of the time seems implausible.