Could not find local user data for {Username}@{wiki}
Closed, ResolvedPublic

Description

This is a recurring bug with users not being able to log in, due to inconsistent state between CentralAuth and local wikis (specifically the local user table row missing in the latter).

Problem description: The user gets an exception whenever they try to login or get an exception when opening the CentralAuth page (of the user): https://en.wikipedia.org/wiki/Special:CentralAuth/USERNAME

For user which this affects

A possible fix: Enter your username at testwiki:Special:CentralAuth. Look for text of form Could not find local user data for username@X, where username is your name and X is the name of a wiki. Try to log in at that wiki. For example, if X is bewiki then enter be: in the search box to get to the be: wiki, or look for the wiki at Special:SiteMatrix. The language may be foreign but the login link should be at the top right or top left, or you can enter Special:UserLogin in the search box.

Notes

This description was edited to reflect the actual issue, and not only about the first affected users situation. ~@Josve05a

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 298554 had a related patch set uploaded (by Mattflaschen):
Troubleshoot why Echo is still triggering CA failures

https://gerrit.wikimedia.org/r/298554

Change 298553 merged by jenkins-bot:
Troubleshoot why Echo is still triggering CA failures

https://gerrit.wikimedia.org/r/298553

Change 298554 merged by jenkins-bot:
Troubleshoot why Echo is still triggering CA failures

https://gerrit.wikimedia.org/r/298554

Mentioned in SAL [2016-07-12T19:30:11Z] <mattflaschen@tin> Synchronized php-1.28.0-wmf.8/extensions/Echo/includes/ForeignWikiRequest.php: T119736: T140144: Troubleshoot why Echo is still triggering CA failures (duration: 00m 39s)

The Echo fix didn't work as intended, so I put up a troubleshooting patch. Please follow T140144: Echo triggering CentralAuth "Can only obtain a centralauthtoken when using CentralAuth sessions" error for the Echo-specific part going forward.

Michgrig removed a subscriber: Michgrig.Jul 12 2016, 9:04 PM
jayvdb added a subscriber: jayvdb.Jul 13 2016, 12:24 AM

Reading through this task, we've failed in a number of unacceptable ways here. Once the immediate issue is resolved and we're no longer throwing thousands of exceptions per day, this task should have a full postmortem to ensure that we identify and learn from what went wrong here. I'll file a separate task about doing this.

Agreed. I've set a meeting for 20-July-2016 at 11:30-12:30 SF time (subject to change, but wanted to get something on people's calendars before their schedules fill up). Anyone who wants to join, please email me directly well ahead of time so I can add you to the Hangout.

Special:CentralAuth for Acastrohowell gives Fatal exception of type "Exception" after getting To avoid creating high replication lag, this transaction was aborted because the write duration (6.1167709827423) exceeded the 5 seconds limit. If you are changing many items at once, try doing multiple smaller operations instead. when attempting to create at enwiki.

Error was "Could not find local user data for Acastrohowell@enwiki" which interestingly was the "new" wiki for the account. Manually deleted the dangling reference and https://en.wikipedia.org/w/index.php?title=Special%3ACentralAuth&target=Acastrohowell works now.

Not sure if this will be helpful to you or not, but all of the exceptions that I have reported in this task were new accounts created via Special:CreateAccount at enwiki. @ori provided a trace for the first one that I encountered and reported on IRC which did not point to Echo.

Anomie added a subscriber: aaron.Jul 13 2016, 5:37 PM

Not sure if this will be helpful to you or not, but all of the exceptions that I have reported in this task were new accounts created via Special:CreateAccount at enwiki. @ori provided a trace for the first one that I encountered and reported on IRC which did not point to Echo.

The real cause of the problem for Optimus24 was very likely

exception.log-20160629.gz
2016-06-29 04:26:37 [V3NN9ApAIDEAAAuRbAgAAAAL] mw1179 enwiki 1.28.0-wmf.7 exception ERROR: [V3NN9ApAIDEAAAuRbAgAAAAL] /w/index.php?title=Special:CreateAccount&returnto=[REDACTED]   DBTransactionError from line 235 of /srv/mediawiki/php-1.28.0-wmf.7/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.002995967865) exceeded the 5 seconds limit.
If you are changing many items at once, try doing multiple smaller operations instead. {"exception_id":"V3NN9ApAIDEAAAuRbAgAAAAL"} 
[Exception DBTransactionError] (/srv/mediawiki/php-1.28.0-wmf.7/includes/db/loadbalancer/LBFactory.php:235) To avoid creating high replication lag, this transaction was aborted because the write duration (6.002995967865) exceeded the 5 seconds limit.
If you are changing many items at once, try doing multiple smaller operations instead.
  #0 /srv/mediawiki/php-1.28.0-wmf.7/includes/db/loadbalancer/LoadBalancer.php(1310): Closure$LBFactory::commitMasterChanges(DatabaseMysqli)
  #1 /srv/mediawiki/php-1.28.0-wmf.7/includes/db/loadbalancer/LBFactory.php(237): LoadBalancer->forEachOpenConnection(Closure$LBFactory::commitMasterChanges;240668706)
  #2 /srv/mediawiki/php-1.28.0-wmf.7/includes/db/loadbalancer/LBFactoryMulti.php(411): Closure$LBFactory::commitMasterChanges#2(LoadBalancer)
  #3 /srv/mediawiki/php-1.28.0-wmf.7/includes/db/loadbalancer/LBFactory.php(238): LBFactoryMulti->forEachLB(Closure$LBFactory::commitMasterChanges#2;240668706)
  #4 /srv/mediawiki/php-1.28.0-wmf.7/includes/MediaWiki.php(562): LBFactory->commitMasterChanges(string, array)
  #5 /srv/mediawiki/php-1.28.0-wmf.7/includes/MediaWiki.php(540): MediaWiki::preOutputCommit(RequestContext)
  #6 /srv/mediawiki/php-1.28.0-wmf.7/includes/MediaWiki.php(753): MediaWiki->doPreOutputCommit()
  #7 /srv/mediawiki/php-1.28.0-wmf.7/includes/MediaWiki.php(520): MediaWiki->main()
  #8 /srv/mediawiki/php-1.28.0-wmf.7/index.php(43): MediaWiki->run()
  #9 /srv/mediawiki/w/index.php(3): include(string)
  #10 {main}

The immediate culprit for this one is @aaron's $wgMaxUserDBWriteDuration arbitrarily rolling back the creation of the user because the request took more than 5 seconds to process for some unknown reason. Mintudazel, Mtrmananes, and Agusteknik2016 that you reported all look like the same thing.

For anyone else with access to the logs who wants to try to track down causes, this is the procedure:

  1. Determine the wiki with the problem, i.e. the "{wiki}" in "Could not find local user data for {Username}@{wiki}".
  2. Find the request ID for the creation of the user on the wiki. Generally, this will be in CentralAuth.log as "Attaching local user {Username}@{wiki}"
  3. Look for other log messages, particularly exception logs, for that request ID.
    1. If the exception found would have caused a database rollback, you've probably found the culprit.
    2. If the exception was before yesterday and the creation in step 2 was "by 'login'" chances are pretty good that https://gerrit.wikimedia.org/r/298531 fixed it already. Look for more recent examples.
bd808 added a comment.EditedJul 13 2016, 5:46 PM

Not sure if this will be helpful to you or not, but all of the exceptions that I have reported in this task were new accounts created via Special:CreateAccount at enwiki. @ori provided a trace for the first one that I encountered and reported on IRC which did not point to Echo.

The stack traces for the end user visible "Could not find local user data for ..." exceptions actually aren't very interesting. They will all lead back to one of two or three methods in CentralAuthUser where the data from the localuser table is checked against the attached wiki user table.

The interesting traces are the ones that caused the transaction creating the user row to be rolled back after the CentralAuth association was made initially. These however are pretty tricky to find and actually correlate with the creation of the localuser record. I mostly stumbled upon the correlation mentioned in T119736#2451220 that lead to @Catrope and @Mattflaschen-WMF's work in T140144.

@Anomie's patch at rMW83c66caa0831: AuthManager: Commit transaction after auto-creating a user should help quite a bit with narrowing the types of problems that can cause the local wiki's User record to be rolled back. It forces a transaction commit just before the end of AuthManager::autoCreateUser. The window for a rollback error is now limited to unhandled exceptions and fatal errors that happen via:

  • autoCreatedAccount calls on primary and secondary authentication providers
  • AuthPluginAutoCreate and LocalUserCreated hook handlers
  • User::saveSettings
  • User::addWatch
  • ManualLogEntry::insert

The Flow related bug was causing an unhandled exception during User::saveSettings via a UserSaveSettings hook call.

The window for a rollback error is now limited to unhandled exceptions and fatal errors that happen via:

Note this is the window for the error when it happens during auto-creation due to CentralAuth sessions or the like. There's also auto-creation when logging in (which the patch might help with too) and normal creations.

The Flow related bug was causing an unhandled exception during User::saveSettings via a UserSaveSettings hook call.

I think you mean Echo.

The immediate culprit for this one is @aaron's $wgMaxUserDBWriteDuration arbitrarily rolling back the creation of the user because the request took more than 5 seconds to process for some unknown reason.

This appears to still be happening, BTW. Looking at timestamps on log messages, whatever is being slow is somewhere in this code. Unfortunately, that's not hugely helpful, since it includes a lot of stuff:

  • A 'UserSaveOptions' hook call
  • A 'LocalUserCreated' hook call
  • Another 'UserSaveOptions' hook call
  • A 'UserSaveSettings' hook call
  • Publishing a log entry (which unfortunately gets rolled back too)
  • Various SecondaryAuthenticationProviders' calls, including CampaignsSecondaryAuthenticationProvider, MediaWiki\Auth\ResetPasswordSecondaryAuthenticationProvider, and MediaWiki\Auth\EmailNotificationSecondaryAuthenticationProvider.

Change 298817 had a related patch set uploaded (by Anomie):
Add timing data logging for T119736

https://gerrit.wikimedia.org/r/298817

Overview of logs for the Echo part: P3422

Keep in mind the dates of the files are the date of log rotation, not necessarily date of occurrence.

Change 298820 had a related patch set uploaded (by Anomie):
Add timing data logging for T119736

https://gerrit.wikimedia.org/r/298820

Change 298821 had a related patch set uploaded (by Anomie):
Add timing data logging for T119736

https://gerrit.wikimedia.org/r/298821

Change 298820 merged by jenkins-bot:
Add timing data logging for T119736

https://gerrit.wikimedia.org/r/298820

Change 298821 merged by jenkins-bot:
Add timing data logging for T119736

https://gerrit.wikimedia.org/r/298821

Mentioned in SAL [2016-07-13T19:23:03Z] <anomie@tin> Synchronized php-1.28.0-wmf.8/includes/auth/AuthManager.php: Add timing data logging for T119736 (duration: 00m 27s)

Mentioned in SAL [2016-07-13T19:23:56Z] <anomie@tin> Synchronized php-1.28.0-wmf.10/includes/auth/AuthManager.php: Add timing data logging for T119736 (duration: 00m 28s)

Change 298823 had a related patch set uploaded (by Aaron Schulz):
[WIP] Revert "Wrap CentralAuthUser::attach in an explicit transaction"

https://gerrit.wikimedia.org/r/298823

The immediate culprit for this one is @aaron's $wgMaxUserDBWriteDuration arbitrarily rolling back the creation of the user because the request took more than 5 seconds to process for some unknown reason. Mintudazel, Mtrmananes, and Agusteknik2016 that you reported all look like the same thing.

Did this timeout change recently, or is it just that something recently started exceeding the 5 second limit?

Anomie added a comment.EditedJul 13 2016, 8:35 PM

Mentioned in SAL [2016-07-13T19:23:03Z] <anomie@tin> Synchronized php-1.28.0-wmf.8/includes/auth/AuthManager.php: Add timing data logging for T119736 (duration: 00m 27s)

Results: There were four hits so far, V4abdApAEKsAAHuBLOwAAACB, V4afkgpAMEwAAAXlT5kAAAAS, V4agfgpAADsAASl1QIMAAAAT, and V4abdApAEKsAAHuBLOwAAACB . For all four, it's $user->addToDatabase() that took a long time.

For anyone else with access to the logs who wants to try to track down causes, this is the procedure:

I took a stab at this for a few, but I wasn't able to find any of the exceptions. Maybe I'm doing something wrong:

12016-07-13 21:05:42 [V4atJgpAMCIAAFtsg8kAAACM] mw1202 enwiki 1.28.0-wmf.8 exception ERROR: [V4atJgpAMCIAAFtsg8kAAACM] /w/api.php Exception from line 2340 of /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php: Could not find local user data for Clauricekoi@metawiki {"exception_id":"V4atJgpAMCIAAFtsg8kAAACM"}
2[Exception Exception] (/srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php:2340) Could not find local user data for Clauricekoi@metawiki
3 #0 /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php(2241): CentralAuthUser->localUserData(string)
4 #1 /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php(2215): CentralAuthUser->queryAttached()
5 #2 /srv/mediawiki/wmf-config/CommonSettings.php(403): CentralAuthUser->getLocalGroups()
6 #3 /srv/mediawiki/php-1.28.0-wmf.8/includes/Hooks.php(195): Closure$wmfLoadInitialiseSettings#2(User, array, string)
7 #4 /srv/mediawiki/php-1.28.0-wmf.8/includes/password/UserPasswordPolicy.php(150): Hooks::run(string, array)
8 #5 /srv/mediawiki/php-1.28.0-wmf.8/includes/password/UserPasswordPolicy.php(75): UserPasswordPolicy->getPoliciesForUser(User, string)
9 #6 /srv/mediawiki/php-1.28.0-wmf.8/includes/user/User.php(1021): UserPasswordPolicy->checkUserPassword(User, string, string)
10 #7 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AbstractPasswordPrimaryAuthenticationProvider.php(104): User->checkPasswordValidity(string)
11 #8 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/TemporaryPasswordPrimaryAuthenticationProvider.php(142): MediaWiki\Auth\AbstractPasswordPrimaryAuthenticationProvider->checkPasswordValidity(string, string)
12 #9 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AuthManager.php(419): MediaWiki\Auth\TemporaryPasswordPrimaryAuthenticationProvider->beginPrimaryAuthentication(array)
13 #10 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AuthManager.php(349): MediaWiki\Auth\AuthManager->continueAuthentication(array)
14 #11 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiClientLogin.php(91): MediaWiki\Auth\AuthManager->beginAuthentication(array, string)
15 #12 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(1373): ApiClientLogin->execute()
16 #13 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(469): ApiMain->executeAction()
17 #14 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(440): ApiMain->executeActionWithErrorHandling()
18 #15 /srv/mediawiki/php-1.28.0-wmf.8/api.php(83): ApiMain->execute()
19 #16 /srv/mediawiki/w/api.php(3): include(string)
20 #17 {main}
21---
22/a/mw-log/CentralAuth.log:2016-07-13 21:04:56 [V4as9QpAIEUAAHtUYygAAAAC] mw1165 metawiki 1.28.0-wmf.10 CentralAuth INFO: Attaching local user Clauricekoi@metawiki by 'login'
23---
24
25mattflaschen@fluorine:~$ zgrep 'V4as9QpAIEUAAHtUYygAAAAC' /a/mw-log/exception.log /a/mw-log/archive/exception.log-201607* /a/mw-log/fatal.log /a/mw-log/archive/fatal.log-201607* /a/mw-log/hhvm.log /a/mw-log/archive/hhvm.log-201607*
26mattflaschen@fluorine:~$
272016-07-13 20:36:41 [V4amWQpAAEkAAA-hi08AAADH] mw1278 enwiki 1.28.0-wmf.8 exception ERROR: [V4amWQpAAEkAAA-hi08AAADH] /w/api.php Exception from line 2340 of /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php: Could not find local user data for Imasledgehammer@mediawikiwiki {"exception_id":"V4amWQpAAEkAAA-hi08AAADH"}
28[Exception Exception] (/srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php:2340) Could not find local user data for Imasledgehammer@mediawikiwiki
29 #0 /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php(2241): CentralAuthUser->localUserData(string)
30 #1 /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php(2215): CentralAuthUser->queryAttached()
31 #2 /srv/mediawiki/wmf-config/CommonSettings.php(403): CentralAuthUser->getLocalGroups()
32 #3 /srv/mediawiki/php-1.28.0-wmf.8/includes/Hooks.php(195): Closure$wmfLoadInitialiseSettings#2(User, array, string)
33 #4 /srv/mediawiki/php-1.28.0-wmf.8/includes/password/UserPasswordPolicy.php(150): Hooks::run(string, array)
34 #5 /srv/mediawiki/php-1.28.0-wmf.8/includes/password/UserPasswordPolicy.php(75): UserPasswordPolicy->getPoliciesForUser(User, string)
35 #6 /srv/mediawiki/php-1.28.0-wmf.8/includes/user/User.php(1021): UserPasswordPolicy->checkUserPassword(User, string, string)
36 #7 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AbstractPasswordPrimaryAuthenticationProvider.php(104): User->checkPasswordValidity(string)
37 #8 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/TemporaryPasswordPrimaryAuthenticationProvider.php(142): MediaWiki\Auth\AbstractPasswordPrimaryAuthenticationProvider->checkPasswordValidity(string, string)
38 #9 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AuthManager.php(419): MediaWiki\Auth\TemporaryPasswordPrimaryAuthenticationProvider->beginPrimaryAuthentication(array)
39 #10 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AuthManager.php(349): MediaWiki\Auth\AuthManager->continueAuthentication(array)
40 #11 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiClientLogin.php(91): MediaWiki\Auth\AuthManager->beginAuthentication(array, string)
41 #12 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(1373): ApiClientLogin->execute()
42 #13 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(469): ApiMain->executeAction()
43 #14 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(440): ApiMain->executeActionWithErrorHandling()
44 #15 /srv/mediawiki/php-1.28.0-wmf.8/api.php(83): ApiMain->execute()
45 #16 /srv/mediawiki/w/api.php(3): include(string)
46 #17 {main}
47
48---
49/a/mw-log/CentralAuth.log:2016-07-13 20:36:21 [V4amQQpAIEUAAFaHQzYAAABS] mw1165 mediawikiwiki 1.28.0-wmf.10 CentralAuth INFO: Attaching local user Imasledgehammer@mediawikiwiki by 'login'
50---
51
52mattflaschen@fluorine:~$ zgrep 'V4amQQpAIEUAAFaHQzYAAABS' /a/mw-log/exception.log /a/mw-log/archive/exception.log-20160*
53mattflaschen@fluorine:~$
54
55mattflaschen@fluorine:~$ zgrep 'V4amQQpAIEUAAFaHQzYAAABS' /a/mw-log/fatal.log /a/mw-log/archive/fatal.log-20160*
56mattflaschen@fluorine:~$
572016-07-13 20:59:42 [V4arvQpAMCIAAFhvDqkAAAES] mw1202 enwiki 1.28.0-wmf.8 exception ERROR: [V4arvQpAMCIAAFhvDqkAAAES] /w/api.php Exception from line 2340 of /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php: Could not find local user data for Sudip541@metawiki {"exception_id":"V4arvQpAMCIAAFhvDqkAAAES"}
58[Exception Exception] (/srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php:2340) Could not find local user data for Sudip541@metawiki
59 #0 /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php(2241): CentralAuthUser->localUserData(string)
60 #1 /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php(2215): CentralAuthUser->queryAttached()
61 #2 /srv/mediawiki/wmf-config/CommonSettings.php(403): CentralAuthUser->getLocalGroups()
62 #3 /srv/mediawiki/php-1.28.0-wmf.8/includes/Hooks.php(195): Closure$wmfLoadInitialiseSettings#2(User, array, string)
63 #4 /srv/mediawiki/php-1.28.0-wmf.8/includes/password/UserPasswordPolicy.php(150): Hooks::run(string, array)
64 #5 /srv/mediawiki/php-1.28.0-wmf.8/includes/password/UserPasswordPolicy.php(75): UserPasswordPolicy->getPoliciesForUser(User, string)
65 #6 /srv/mediawiki/php-1.28.0-wmf.8/includes/user/User.php(1021): UserPasswordPolicy->checkUserPassword(User, string, string)
66 #7 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AbstractPasswordPrimaryAuthenticationProvider.php(104): User->checkPasswordValidity(string)
67 #8 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/TemporaryPasswordPrimaryAuthenticationProvider.php(142): MediaWiki\Auth\AbstractPasswordPrimaryAuthenticationProvider->checkPasswordValidity(string, string)
68 #9 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AuthManager.php(419): MediaWiki\Auth\TemporaryPasswordPrimaryAuthenticationProvider->beginPrimaryAuthentication(array)
69 #10 /srv/mediawiki/php-1.28.0-wmf.8/includes/auth/AuthManager.php(349): MediaWiki\Auth\AuthManager->continueAuthentication(array)
70 #11 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiClientLogin.php(91): MediaWiki\Auth\AuthManager->beginAuthentication(array, string)
71 #12 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(1373): ApiClientLogin->execute()
72 #13 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(469): ApiMain->executeAction()
73 #14 /srv/mediawiki/php-1.28.0-wmf.8/includes/api/ApiMain.php(440): ApiMain->executeActionWithErrorHandling()
74 #15 /srv/mediawiki/php-1.28.0-wmf.8/api.php(83): ApiMain->execute()
75 #16 /srv/mediawiki/w/api.php(3): include(string)
76 #17 {main}
77
78---
79/a/mw-log/CentralAuth.log:2016-07-13 20:59:41 [V4artgpAAE4AAbQHqTMAAAAP] mw1299 metawiki 1.28.0-wmf.10 CentralAuth INFO: Attaching local user Sudip541@metawiki by 'login'
80---
81mattflaschen@fluorine:~$ zgrep 'V4artgpAAE4AAbQHqTMAAAAP' /a/mw-log/exception.log /a/mw-log/archive/exception.log-201607* /a/mw-log/fatal.log /a/mw-log/archive/fatal.log-201607* /a/mw-log/hhvm.log /a/mw-log/archive/hhvm.log-201607*
82mattflaschen@fluorine:~$

At least Clauricekoi doesn't seem to have been a failure. Instead the CentralAuthCreateLocalAccountJob on metawiki started at 21:04:56, the errors showed up from 21:04:57 to 21:05:42, and the CentralAuthCreateLocalAccountJob completed on 21:06:00. It seems like something in this block of code (based on the account attachment in CentralAuthPrimaryAuthenticationProvider::autoCreatedAccount() being logged at 21:04:56 and the on-wiki log message at 21:06) took a strangely long amount of time; this might be related to T121161.

@bd808: What would https://gerrit.wikimedia.org/r/#/c/297936/ do in this situation?

At least Clauricekoi doesn't seem to have been a failure. Instead the CentralAuthCreateLocalAccountJob on metawiki started at 21:04:56, the errors showed up from 21:04:57 to 21:05:42, and the CentralAuthCreateLocalAccountJob completed on 21:06:00. It seems like something in this block of code (based on the account attachment in CentralAuthPrimaryAuthenticationProvider::autoCreatedAccount() being logged at 21:04:56 and the on-wiki log message at 21:06) took a strangely long amount of time; this might be related to T121161.

In this case, I don't think it's related to T121161: CentralAuthCreateLocalAccountJob failing on meta due to Echo deadlocks.

I looked at a couple possibilities:

  • Direct usage by Echo of one of those hooks. It doesn't listen to AuthPluginAutoCreate, and listens to LocalUserCreated, but does nothing if it's auto-created. There is also an Echo UserSaveSettings hook, but it just adjusts the user's count, it doesn't send a notification.
  • I grepped through on GitHub for others that could directly or indirectly trigger a notification on auto-create. For other wikis, it could be indirectly from NewUserMessage, which listens to AuthPluginAutoCreate and LocalUserCreated. However, NewUserMessage is not enabled at all on meta. (It does trigger notifications on auto-create if enabled and configured to do so (wgNewUserMessageOnAutoCreate)). I didn't find any others that trigger on signup.

So, my understanding from your post is, this is a brief window where we might expect the error (would the user see an exception here?), but that the window is too long in this case.

bd808 added a comment.Jul 14 2016, 2:14 AM

At least Clauricekoi doesn't seem to have been a failure. Instead the CentralAuthCreateLocalAccountJob on metawiki started at 21:04:56, the errors showed up from 21:04:57 to 21:05:42, and the CentralAuthCreateLocalAccountJob completed on 21:06:00. It seems like something in this block of code (based on the account attachment in CentralAuthPrimaryAuthenticationProvider::autoCreatedAccount() being logged at 21:04:56 and the on-wiki log message at 21:06) took a strangely long amount of time; this might be related to T121161.

@bd808: What would https://gerrit.wikimedia.org/r/#/c/297936/ do in this situation?

That error was raised on enwiki, so my proposed patch would have ignored the missing metawiki account and submitted a CentralAuthUnattachUserJob job. What happens next would depend on whether or not the CentralAuthCreateLocalAccountJob had finished by the time that the CentralAuthUnattachUserJob executed. If the local account on metawiki exists when the CentralAuthUnattachUserJob job runs then it will leave everything alone. If however the user still did not exist it would remove the localuser entry in the CA database.

JJMC89 added a comment.EditedJul 14 2016, 2:27 AM

[V4b4RwpAADsAAED4S-gAAAAR] 2016-07-14 02:26:16: Fatal exception of type "Exception" for Meatballs123

[V4b9FApAADsAAFbXlS8AAABM] 2016-07-14 02:46:44: Fatal exception of type "Exception" for Er_Sahaj_Arora

greg added a comment.Jul 14 2016, 6:39 AM

https://gerrit.wikimedia.org/r/#/c/289778/ with https://gerrit.wikimedia.org/r/#/c/289780/ fixes this particular case. I'll see if I can get those merged today.

These are still unmerged. Are they still relevant? They will need a shepherd now.

greg added a comment.Jul 14 2016, 6:40 AM

This task needs someone in the "assigned to" field. I'm looking at @Anomie given his responses thus far on this task (especially lately) and given that @Tgr is still out. This issue needs complete follow through now. It's lasted long enough.

Anomie added a comment.EditedJul 14 2016, 5:42 PM

There is also an Echo UserSaveSettings hook, but it just adjusts the user's count, it doesn't send a notification.

It adjusts the users count, after firing a bunch of HTTP requests to other wikis and waiting for them to complete. Even if it's not T121161, I still suspect that as a source of slowness.

So, my understanding from your post is, this is a brief window where we might expect the error (would the user see an exception here?), but that the window is too long in this case.

Yes. https://gerrit.wikimedia.org/r/#/c/298823/ should help with that by not making there be as much of a window where the CA transaction is committed while the local transaction isn't.

If however the user still did not exist it would remove the localuser entry in the CA database.

That's the case I was worried about, and I think that's right. We should probably wait for https://gerrit.wikimedia.org/r/#/c/298823/ before merging your patch then to avoid this race :( On that task, I'm waiting on a reply from @aaron about the potential for it to cause the opposite bug in fixing this one.

https://gerrit.wikimedia.org/r/#/c/289778/ with https://gerrit.wikimedia.org/r/#/c/289780/ fixes this particular case. I'll see if I can get those merged today.

These are still unmerged. Are they still relevant? They will need a shepherd now.

Those seem mostly not relevant to this bug (they might be relevant for other bugs though). The only relevant bit is covered by @bd808's patch now, which I think is good to go right after https://gerrit.wikimedia.org/r/#/c/298823/.

This task needs someone in the "assigned to" field. I'm looking at @Anomie given his responses thus far on this task (especially lately) and given that @Tgr is still out. This issue needs complete follow through now. It's lasted long enough.

The reward for work is more work? ;)

I think we're close on fixing the UBN nature of this task, @bd808's patch should reduce it from "OMG people can't log in!" to "logspam".

greg assigned this task to Anomie.Jul 14 2016, 6:05 PM

This task needs someone in the "assigned to" field. I'm looking at @Anomie given his responses thus far on this task (especially lately) and given that @Tgr is still out. This issue needs complete follow through now. It's lasted long enough.

The reward for work is more work? ;)

I think we're close on fixing the UBN nature of this task, @bd808's patch should reduce it from "OMG people can't log in!" to "logspam".

The reward for good work is more work, yep ;)

Assigning to you while this is still causing people to not log in. (You and not Bryan because it seems like you have the most direct working knowledge and will be able to judge when we've switched from "can't log in" to "logspam".)

It adjusts the users count, after firing a bunch of HTTP requests to other wikis and waiting for them to complete. Even if it's not T121161, I still suspect that as a source of slowness.

I agree it's a cause of slowness. It's an issue, but a planned and temporary one. We should be able to turn it off soon, as soon as our backfill scripts finish on all wikis.

Change 298817 abandoned by Anomie:
Add timing data logging for T119736

Reason:
Served its purpose.

https://gerrit.wikimedia.org/r/298817

greg added a comment.Jul 14 2016, 8:32 PM

I think we're close on fixing the UBN nature of this task, @bd808's patch should reduce it from "OMG people can't log in!" to "logspam".

I couldn't determine with a quick look which specific patch this was referring to: https://gerrit.wikimedia.org/r/#/c/297946/ (2 +1s already) or https://gerrit.wikimedia.org/r/#/c/297936/ ?

Can we just get that done? If there's something that can get us from UBN! to High or Normal, let's do it ASAP.

The latest run of the script is done. This should cover any cases caused by Echo:

[07/14/16 16:02] <legoktm> matt_flaschen: the script finished
[07/14/16 16:02] <legoktm> legoktm@terbium:~$ grep "not found" clurm-20160713.log -c
[07/14/16 16:02] <legoktm> 134

Change 299094 had a related patch set uploaded (by Aaron Schulz):
Make pre-commit DB callbacks more atomic with multi-DB updates

https://gerrit.wikimedia.org/r/299094

Change 299087 had a related patch set uploaded (by Aaron Schulz):
Make LocalFile::lock() initialize DBO_TRX transactions

https://gerrit.wikimedia.org/r/299087

Josve05a updated the task description. (Show Details)Jul 15 2016, 2:34 AM
Josve05a updated the task description. (Show Details)Jul 15 2016, 2:37 AM
greg added a comment.Jul 15 2016, 4:14 AM

Status update? How many users are experiencing this still? Are people newly getting bit by it?

Status update? How many users are experiencing this still? Are people newly getting bit by it?

Meatballs123 is unresolved.

[V4b4RwpAADsAAED4S-gAAAAR] 2016-07-14 02:26:16: Fatal exception of type "Exception" for Meatballs123

greg lowered the priority of this task from Unbreak Now! to High.Jul 15 2016, 6:26 AM

Status update? How many users are experiencing this still? Are people newly getting bit by it?

23:49 < AaronSchu> greg-g: is https://phabricator.wikimedia.org/T119736 still urgent?
04:13 <    greg-g> AaronSchulz: I guess I'm not clear, it's not clear on that task what the status is of new people who are experiencing the issue.
04:53 < AaronSchu> greg-g: I'd like to get my patches in the next train at least. Not sure if there is a need to rush beyond that. Also, I'm supposed to be on vacation till Mon ;)
05:26 <    greg-g> AaronSchulz: Are they worth having anom.ie /lego / whoever deploy them tomorrow, or can they wait? Basically, I need to be able to say "This is no longer UBN!" for a reason.
06:02 < AaronSchu> greg-g: judging from kibana I'd go with the regular train
aaron added a comment.Jul 15 2016, 6:53 AM

'+channel:exception +message:"* local user data *"' shows about 96 users in the last 6 hours.

@greg: Status update: I +2ed the patches that should turn this from "OMG people can't log in!" to "logspam". Current status is that they'll go with the next train.

Change 299094 merged by jenkins-bot:
Make pre-commit DB callbacks more atomic with multi-DB updates

https://gerrit.wikimedia.org/r/299094

Change 299087 merged by jenkins-bot:
Make LocalFile::lock() initialize DBO_TRX transactions

https://gerrit.wikimedia.org/r/299087

Change 298823 merged by jenkins-bot:
Revert "Wrap CentralAuthUser::attach in an explicit transaction"

https://gerrit.wikimedia.org/r/298823

Change 297946 merged by jenkins-bot:
Detach accounts during global rename

https://gerrit.wikimedia.org/r/297946

Change 297936 merged by jenkins-bot:
Cleanup localuser records when wiki is missing data

https://gerrit.wikimedia.org/r/297936

greg added a comment.Jul 15 2016, 5:53 PM

@greg: Status update: I +2ed the patches that should turn this from "OMG people can't log in!" to "logspam". Current status is that they'll go with the next train.

Thanks @Anomie (and everyone else) :) Have a good weekend.

It adjusts the users count, after firing a bunch of HTTP requests to other wikis and waiting for them to complete. Even if it's not T121161, I still suspect that as a source of slowness.

I agree it's a cause of slowness. It's an issue, but a planned and temporary one. We should be able to turn it off soon, as soon as our backfill scripts finish on all wikis.

The backfill scripts are done, and the temporary transition config has been removed, so these Echo HTTP requests to get counts from other wikis are no longer happening.

Change 299896 had a related patch set uploaded (by Gergő Tisza):
Cleanup localuser records when wiki is missing data

https://gerrit.wikimedia.org/r/299896

Change 299896 merged by jenkins-bot:
Cleanup localuser records when wiki is missing data

https://gerrit.wikimedia.org/r/299896

Tgr added a comment.Jul 20 2016, 8:19 PM

Exception frequency is now back to pre-Echo levels, which is to say, still high, only 100/day instead of 1000/day (there is probably no user impact anymore due to Bryan's patch, though). Do we have any idea what causes that? Previously I have assumed T137973 but that does not match the timeline well (login errors were very rare till end of June, hundreds a day from July 1 to 5, thousands a day from 6 to 19, back to hundreds a day now; renames were broken mid-June and fixed on July 11).

Tgr added a comment.Jul 20 2016, 8:20 PM

And why is the logstash data so different from P3370?

which is to say, still high, only 100/day instead of 1000/day (there is probably no user impact anymore due to Bryan's patch, though). Do we have any idea what causes that?

Bryan's fix doesn't land everywhere until the train tomorrow, as far as I know.

I still see a fair number of the rollbacks for "To avoid creating high replication lag" in CreateAccount. I tracked that as far as the call to User::addToDatabase() being slow, I suppose the next step is to push timing logging down into User unless @aaron has a better idea.

Agreed. I've set a meeting for 20-July-2016 at 11:30-12:30 SF time (subject to change, but wanted to get something on people's calendars before their schedules fill up). Anyone who wants to join, please email me directly well ahead of time so I can add you to the Hangout.

Notes posted at https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth/Retrospective

bd808 added a comment.EditedJul 20 2016, 10:22 PM

And why is the logstash data so different from P3370?

The logstash data doesn't look much different to me for the days that are actually currently in the backing indices. We dropped everything before 2016-07-01 as part of the Elasticsearch and Kibana upgrade this week and 2016-07-04's data is still being imported by my recovery script.

The relevant counts from @Legoktm's grep are:

exception.log-20160701.gz:420
exception.log-20160702.gz:352
exception.log-20160703.gz:196
exception.log-20160704.gz:246
exception.log-20160705.gz:298
exception.log-20160706.gz:872
exception.log-20160707.gz:3284
exception.log-20160708.gz:4056
exception.log-20160709.gz:4686
exception.log-20160710.gz:4084
exception.log-20160711.gz:4730

Note that the timestamp of an exception.log archive file is the date that the rotation happened, not necessarily the date that the events in the file occurred.

The last 24 hours look much much better:

Change 298527 abandoned by Mattflaschen:
Add single-user option to checkLocalUser.php

Reason:
This branch is not deployed anymore.

https://gerrit.wikimedia.org/r/298527

Change 298528 abandoned by Mattflaschen:
Add single-user option to checkLocalUser.php

Reason:
This branch is not deployed anymore.

https://gerrit.wikimedia.org/r/298528

Tgr added a comment.Jul 21 2016, 12:41 AM

I still see a fair number of the rollbacks for "To avoid creating high replication lag" in CreateAccount. I tracked that as far as the call to User::addToDatabase() being slow, I suppose the next step is to push timing logging down into User unless @aaron has a better idea.

Oh yeah, sorry, you already wrote that earilier. The volume of those errors seems to match the volume of the login errors. If I interpret these messages right, most of the time is spent waiting for the INSERT IGNORE INTO user query to finish. Something gap locks user and the insert times out waiting for it?

(If that's the case the whole "to avoid creating high replication lag" thing seems counterproductive - if the query is only slow due to locking and not by its own right, replicating it would not be slow and there is no need to roll it back and potentially break DB consistency in the process.)

Bovlb removed a subscriber: Bovlb.Jul 21 2016, 4:09 AM

I still see a fair number of the rollbacks for "To avoid creating high replication lag" in CreateAccount. I tracked that as far as the call to User::addToDatabase() being slow, I suppose the next step is to push timing logging down into User unless @aaron has a better idea.

Oh yeah, sorry, you already wrote that earilier. The volume of those errors seems to match the volume of the login errors. If I interpret these messages right, most of the time is spent waiting for the INSERT IGNORE INTO user query to finish. Something gap locks user and the insert times out waiting for it?

(If that's the case the whole "to avoid creating high replication lag" thing seems counterproductive - if the query is only slow due to locking and not by its own right, replicating it would not be slow and there is no need to roll it back and potentially break DB consistency in the process.)

It looks to me that you're right.

Ten days later, are there any updates to share?
Is this pretty much solved? Or not?

RandomDSdevel added a subscriber: RandomDSdevel.

Twenty days later, are there any updates to share?
Is this pretty much solved? Or not?

bd808 closed this task as Resolved.Aug 11 2016, 3:28 PM

Twenty days later, are there any updates to share?
Is this pretty much solved? Or not?

I did a new grep for the error message

1fluorine:/a/mw-log/archive
2bd808$ zgrep -c "Could not find local user data for" exception.log-20160*
3exception.log-20160513.gz:108
4exception.log-20160514.gz:12
5exception.log-20160515.gz:14
6exception.log-20160516.gz:20
7exception.log-20160517.gz:34
8exception.log-20160518.gz:26
9exception.log-20160519.gz:20
10exception.log-20160520.gz:42
11exception.log-20160521.gz:30
12exception.log-20160522.gz:30
13exception.log-20160523.gz:22
14exception.log-20160524.gz:36
15exception.log-20160525.gz:22
16exception.log-20160526.gz:40
17exception.log-20160527.gz:58
18exception.log-20160528.gz:94
19exception.log-20160529.gz:22
20exception.log-20160530.gz:38
21exception.log-20160531.gz:114
22exception.log-20160601.gz:374
23exception.log-20160602.gz:62
24exception.log-20160603.gz:40
25exception.log-20160604.gz:48
26exception.log-20160605.gz:34
27exception.log-20160606.gz:30
28exception.log-20160607.gz:52
29exception.log-20160608.gz:116
30exception.log-20160609.gz:150
31exception.log-20160610.gz:128
32exception.log-20160611.gz:412
33exception.log-20160612.gz:256
34exception.log-20160613.gz:1052
35exception.log-20160614.gz:512
36exception.log-20160615.gz:370
37exception.log-20160616.gz:302
38exception.log-20160617.gz:392
39exception.log-20160618.gz:280
40exception.log-20160619.gz:350
41exception.log-20160620.gz:288
42exception.log-20160621.gz:346
43exception.log-20160622.gz:316
44exception.log-20160623.gz:490
45exception.log-20160624.gz:240
46exception.log-20160625.gz:270
47exception.log-20160626.gz:166
48exception.log-20160627.gz:214
49exception.log-20160628.gz:214
50exception.log-20160629.gz:316
51exception.log-20160630.gz:1354
52exception.log-20160701.gz:420
53exception.log-20160702.gz:352
54exception.log-20160703.gz:196
55exception.log-20160704.gz:246
56exception.log-20160705.gz:298
57exception.log-20160706.gz:872
58exception.log-20160707.gz:3284
59exception.log-20160708.gz:4056
60exception.log-20160709.gz:4686
61exception.log-20160710.gz:4084
62exception.log-20160711.gz:4730
63exception.log-20160712.gz:5298
64exception.log-20160713.gz:4492
65exception.log-20160714.gz:2632
66exception.log-20160715.gz:2260
67exception.log-20160716.gz:2278
68exception.log-20160717.gz:2520
69exception.log-20160718.gz:2728
70exception.log-20160719.gz:2586
71exception.log-20160720.gz:1006
72exception.log-20160721.gz:122
73exception.log-20160722.gz:50
74exception.log-20160723.gz:0
75exception.log-20160724.gz:0
76exception.log-20160725.gz:0
77exception.log-20160726.gz:0
78exception.log-20160727.gz:0
79exception.log-20160728.gz:0
80exception.log-20160729.gz:0
81exception.log-20160730.gz:0
82exception.log-20160731.gz:0
83exception.log-20160801.gz:0
84exception.log-20160802.gz:0
85exception.log-20160803.gz:0
86exception.log-20160804.gz:0
87exception.log-20160805.gz:0
88exception.log-20160806.gz:0
89exception.log-20160807.gz:0
90exception.log-20160808.gz:0
91exception.log-20160809.gz:0
92exception.log-20160810.gz:0
93exception.log-20160811.gz:0

The error rate dropped to zero in the 2016-07-22 archive file (date created, not necessarily date of log events) and has stayed there since. I actually expected to still see a few happening each day, but it looks like for now at least we have cleaned up the error paths that were causing rollback of the CA data without rollback of the associated local user data. I think we can call this resolved and open a new bug if we have a regression in the future.

Change 289778 abandoned by Chad:
Safely handle policies for local groups

https://gerrit.wikimedia.org/r/289778

Change 289780 abandoned by Chad:
Redo local password enforcement

https://gerrit.wikimedia.org/r/289780