Page MenuHomePhabricator

Users getting logged-out during minor network glitches
Closed, ResolvedPublic

Description

(I'm not entirely sure if this is CentralAuth, AuthManager or some other component and which project tag should be used for it; filing under General/Unknown for now and until we figure it out)

During today's (2016-08-19) network maintenance in eqiad and due to a router OS bug, we had two network partitioning events that specifically affected row D, i.e. 1/4th of our datacenter. The first lasted for a few minutes; the second lasted probably for a few milliseconds.

During both, we got reports from users that they were seeing exceptions, could not login, and more importantly, of users being silently logged-out(!).

Example 1:

15:19 < yannf> https://fr.wikisource.org/w/index.php?title=Livre:Reclus_-_La_Commune_de_Paris_au_jour_le_jour.djvu&action=purge
15:20 < yannf> it works now, but...
15:20 < yannf> it asks for confirmation
15:20 < yannf> which it didn't before

(we ask anonymous users for confirmation on action=purge, but not logged-in users)

Example 2:

15:45 < Revent> [V7WtjQpAADwAAZV-nOoAAAAN] 2016-08-18 12:44:01: Fatal exception of type "MWException" <- when attempting to login to enwiki

which seems to be this:

2016-08-18 12:44:01 [V7WtjQpAADwAAZV-nOoAAAAN] mw1265 enwiki 1.28.0-wmf.14 exception ERROR: [V7WtjQpAADwAAZV-nOoAAAAN] /w/index.php?title=Special:UserLogin&returnto=(censored)   MWException from line 4001 of /srv/mediawiki/php-1.28.0-wmf.14/includes/user/User.php: CAS update failed on user_touched for user ID '(censored)' (read from slave); the version of the user to be saved is older than the current version. {"exception_id":"V7WtjQpAADwAAZV-nOoAAAAN"} 
[Exception MWException] (/srv/mediawiki/php-1.28.0-wmf.14/includes/user/User.php:4001) CAS update failed on user_touched for user ID '(censored)' (read from slave); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.28.0-wmf.14/extensions/VisualEditor/VisualEditor.hooks.php(1012): User->saveSettings()
  #1 /srv/mediawiki/php-1.28.0-wmf.14/includes/Hooks.php(195): VisualEditorHooks::onUserLoggedIn(User)
  #2 /srv/mediawiki/php-1.28.0-wmf.14/includes/auth/AuthManager.php(2351): Hooks::run(string, array)
  #3 /srv/mediawiki/php-1.28.0-wmf.14/includes/auth/AuthManager.php(655): MediaWiki\Auth\AuthManager->setSessionDataForUser(User, boolean)
  #4 /srv/mediawiki/php-1.28.0-wmf.14/includes/auth/AuthManager.php(349): MediaWiki\Auth\AuthManager->continueAuthentication(array)
  #5 /srv/mediawiki/php-1.28.0-wmf.14/includes/specialpage/AuthManagerSpecialPage.php(354): MediaWiki\Auth\AuthManager->beginAuthentication(array, string)
  #6 /srv/mediawiki/php-1.28.0-wmf.14/includes/specialpage/AuthManagerSpecialPage.php(483): AuthManagerSpecialPage->performAuthenticationStep(string, array)
  #7 /srv/mediawiki/php-1.28.0-wmf.14/includes/htmlform/HTMLForm.php(635): AuthManagerSpecialPage->handleFormSubmit(array, VFormHTMLForm)
  #8 /srv/mediawiki/php-1.28.0-wmf.14/includes/specialpage/AuthManagerSpecialPage.php(417): HTMLForm->trySubmit()
  #9 /srv/mediawiki/php-1.28.0-wmf.14/includes/specialpage/LoginSignupSpecialPage.php(292): AuthManagerSpecialPage->trySubmit()
  #10 /srv/mediawiki/php-1.28.0-wmf.14/includes/specialpage/SpecialPage.php(505): LoginSignupSpecialPage->execute(NULL)
  #11 /srv/mediawiki/php-1.28.0-wmf.14/includes/specialpage/SpecialPageFactory.php(598): SpecialPage->run(NULL)
  #12 /srv/mediawiki/php-1.28.0-wmf.14/includes/MediaWiki.php(283): SpecialPageFactory::executePath(Title, RequestContext)
  #13 /srv/mediawiki/php-1.28.0-wmf.14/includes/MediaWiki.php(749): MediaWiki->performRequest()
  #14 /srv/mediawiki/php-1.28.0-wmf.14/includes/MediaWiki.php(521): MediaWiki->main()
  #15 /srv/mediawiki/php-1.28.0-wmf.14/index.php(43): MediaWiki->run()
  #16 /srv/mediawiki/w/index.php(3): include(string)
  #17 {main}

Getting exceptions and/or other errors when these events happen is kind of expected, if it's not unavoidable. Silently logging-out users is not and as far as I know this odd behavior is new. From the backtrace above, it looks like it may be AuthManager-related, but someone would have to dig a little deeper into all that and test MW authentication code's resiliency to such availability events.

Event Timeline

SessionManager unpersists the session if it does not match the stored session data (ie. it deletes the session/token cookies if they do not match the redis data). As far as I can see that should only happen for users who did not click "remember me" (otherwise they can be authenticated by user_token even if the session storage is inaccessible).

A user noticing that purge works strangely but not noticing that they have been logged out means that either

  • they just did not notice that the user bar is gone. Unlikely but possible I guess.
  • MediaWiki sees them as logged-out but generates a logged-in skin for them. I can't see that happening.
  • a CSRF token error? that could be caused by session loss / session read failure, but I don't think we require a token from logged-in users.
  • OTOH, IIRC, we don't allow purge on GET anymore and do some JS trick to turn it into a POST. Ori probably knows more about that. Maybe that process involves a token or some other session dependency. Or maybe it just failed in some other way due to a network error and fell back to the non-JS behavior of showing the form?

No clue about the CAS error. Those do happen naturally (but rarely) when two processes try to update the user record the same time, or when a process that tries to update the user record is working from outdated user data. I guess if the user object cannot be cached due to network issues, that could make them more frequent?

Change 305595 had a related patch set uploaded (by Aaron Schulz):
Defer preferences update and avoid CAS errors in onUserLoggedIn()

https://gerrit.wikimedia.org/r/305595

As for "example 1", that is normal behavior now (the form must be POSTed whether logged in or not).

No clue about the CAS error. Those do happen naturally (but rarely) when two processes try to update the user record the same time, or when a process that tries to update the user record is working from outdated user data. I guess if the user object cannot be cached due to network issues, that could make them more frequent?

"No clue" and "happens naturally" don't sound very reassuring :) Any ideas on how to proceed here? We should get to the bottom of this.

For what it's worth:

faidon@fluorine:/a/mw-log$ grep -c 'CAS update failed' exception.log 
560
faidon@fluorine:/a/mw-log$ grep 'CAS update failed' exception.log | grep -c '2016-08-18 12:[01]'
146

12:05 - 12:13 is when the first network partition that I mentioned above happened, so it definitely looks like there is a correlation here.

Could be cache purges not making it to some mc servers and stale cached version of user setting loading and saved (which will error out). In any case, the patch above should avoid the CAS error problem.

Change 305595 merged by jenkins-bot:
Defer preferences update and avoid CAS errors in onUserLoggedIn()

https://gerrit.wikimedia.org/r/305595

(we ask anonymous users for confirmation on action=purge, but not logged-in users)

That's no longer true as of 1.28.0-wmf.15, see rMWc2ce6a1b6085: Require POST for action=purge in PurgeAction.

Krinkle subscribed.

Tagging us to follow-up and confirm this is gone.

@faidon Are there any known timestamps in the last 30 days during which we know a similar situation existed in the data centre, to help correlate with error logs?

Imarlier claimed this task.
Imarlier subscribed.

Closing as not worth further effort at this point. If we see reappearance of these types of issues in the future, we can address at that point.

It's been a couple of years since I filed this and I don't remember much since, so unfortunately I don't have any more insight at this point. These kind of widespread network events are very rare and there are no such outages recently I'm afraid. We could figure out ways to simulate them from e.g. mwdebug though, although I doubt that anyone has the time to investigate this in such depth, so I don't particularly disagree with resolving this task instead.

That said, I did a quick count (and hope I didn't mess it up), and it seems like the exception logs are filled with CAS update failed on user_touched error messages, with about ~1000/day being emitted, which currently accounts to ~30-40% of all exception log lines. I'm not sure if these are real errors or just noise, but regardless, it would be nice to get to a healthy baseline so that we can use those error messages as a meaningful signal of an operational issue. Perhaps this belongs to a separate task, though; thoughts, @Krinkle, @Imarlier et al?

The CAS error is basically edit conflict for user changes (probably the login token as the other fields don't change much); some of that happening is normal. 1000/day is more than I'd have expected but still not that much for a hundred thousand users.

There is no reason to log it as an exception though, that was probably only done as a pre-structured-logging way of getting a stack trace. -> https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/446593

I note that patch won't fix the "CAS update failed on user_touched" message faidon mentioned, only the rather few "CAS update failed on gu_cas_token" messages.