Page MenuHomePhabricator

Some account creations causing exceptions
Closed, ResolvedPublic

Description

Only 29 in exception.log like this so far, but should get this fixed asap

2015-07-02 19:26:22 mw1074 enwiki exception INFO: [8500a51c] /w/index.php?title=Special:UserLogin&action=submitlogin&type=signup&returnto=Under_Armour_All-America_Game&returntoquery=action%3Dedit%26section%3D13 Exception from line 2130 of /srv/mediawiki/php-1.26wmf12/extensions/CentralAuth/includes/CentralAuthUser.php: Could not find local user data for <XXXXXXX>@enwiki
#0 /srv/mediawiki/php-1.26wmf12/extensions/CentralAuth/includes/CentralAuthUser.php(2070): CentralAuthUser->localUserData()
#1 /srv/mediawiki/php-1.26wmf12/extensions/CentralAuth/includes/CentralAuthUser.php(2030): CentralAuthUser->queryAttached()
#2 /srv/mediawiki/php-1.26wmf12/extensions/CentralAuth/includes/CentralAuthHooks.php(2078): CentralAuthUser->getLocalGroups()
#3 /srv/mediawiki/php-1.26wmf12/includes/Hooks.php(204): CentralAuthHooks::onPasswordPoliciesForUser()
#4 /srv/mediawiki/php-1.26wmf12/includes/password/UserPasswordPolicy.php(107): Hooks::run()
#5 /srv/mediawiki/php-1.26wmf12/includes/password/UserPasswordPolicy.php(74): UserPasswordPolicy->getPoliciesForUser()
#6 /srv/mediawiki/php-1.26wmf12/includes/User.php(859): UserPasswordPolicy->checkUserPassword()
#7 /srv/mediawiki/php-1.26wmf12/includes/specials/SpecialUserlogin.php(548): User->checkPasswordValidity()
#8 /srv/mediawiki/php-1.26wmf12/includes/specials/SpecialUserlogin.php(377): LoginForm->addNewAccountInternal()
#9 /srv/mediawiki/php-1.26wmf12/includes/specials/SpecialUserlogin.php(314): LoginForm->addNewAccount()
#10 /srv/mediawiki/php-1.26wmf12/includes/specialpage/SpecialPage.php(384): LoginForm->execute()
#11 /srv/mediawiki/php-1.26wmf12/includes/specialpage/SpecialPageFactory.php(550): SpecialPage->run()
#12 /srv/mediawiki/php-1.26wmf12/includes/MediaWiki.php(249): SpecialPageFactory::executePath()
#13 /srv/mediawiki/php-1.26wmf12/includes/MediaWiki.php(668): MediaWiki->performRequest()
#14 /srv/mediawiki/php-1.26wmf12/includes/MediaWiki.php(465): MediaWiki->main()
#15 /srv/mediawiki/php-1.26wmf12/index.php(41): MediaWiki->run()
#16 /srv/mediawiki/w/index.php(3): include()
#17 {main} {"private":false}

Event Timeline

csteipp claimed this task.
csteipp raised the priority of this task from to Unbreak Now!.
csteipp updated the task description. (Show Details)
csteipp added a project: Security-Team.
csteipp added subscribers: csteipp, hoo.

The first part that confused me was how the global user could exist when in onPasswordPoliciesForUser we do

$central = CentralAuthUser::getInstance( $user );
if ( $central->exists() ) {
...

So checking the globaluser table for the user who was in the exception above (timestamp: 2015-07-02 19:26:22), the globaluser table shows gu_registration = 20150702192620. localuser.lu_attached_timestamp = 20150702192620 too. Both 2 seconds before the exception was thrown. But more weird in this case is that on enwiki, the user is in the user table, and user_registration = 20150702192618. 2 seconds before the CentralAuth account.

I'm assuming our db clusters aren't 2 seconds off in their time.

It seems something like,

  1. Account creation is attempted
  2. CentralAuth user is created, and the centralauth transaction finishes
  3. Something happens, and the enwiki transaction isn't finished yet.
  4. Request is retried.
  5. In SpecialUserLogin $u->idForName() is 0
  6. CentralAuthUser for the account does exist, so CentralAuthUser::localUserData is called for the local wiki, exception is thrown.
  7. enwiki transaction finishes, so user is in the user table

@aaron, do you know if that sequence of events is possible?

Change 222520 had a related patch set uploaded (by Legoktm):
Revert "Add global password policies"

https://gerrit.wikimedia.org/r/222520

Change 222521 had a related patch set uploaded (by CSteipp):
Revert "Add global password policies"

https://gerrit.wikimedia.org/r/222521

Change 222520 merged by jenkins-bot:
Revert "Add global password policies"

https://gerrit.wikimedia.org/r/222520

Change 222521 merged by jenkins-bot:
Revert "Add global password policies"

https://gerrit.wikimedia.org/r/222521

Global checking reverted for now, until we can figure out the issue:

https://gerrit.wikimedia.org/r/#/c/222520/

@aaron, do you know if that sequence of events is possible?

That would be my guess. That captcha will slow human retry attempts a bit but I suppose it's possible. Either the local/central transactions were half finished or the local DB had more slave lag.

Do all the exception involve the same wiki? Do any involve some other wiki having a local user row missing while logging into the current wiki?

It's interesting that register() does not use a trx but attach() does (due to https://gerrit.wikimedia.org/r/#/c/154443/). Out of curiosity, I wonder what that ways for. If the local/central DBs both had end-of-request commits there would less time for them to be out of sync.

Btw, queryAttached() seems kind of inefficient, loading all CA/localuser rows for a user and doing a local user query for each one. I don't see too many callers now (not at key use cases), and some of them are just fallbacks for filling a denormalized value (home/editcount). For users on many wikis, it seems like this would slow down login, though perhaps not by too much in regular cases.

Change 223702 had a related patch set uploaded (by CSteipp):
Add global password policies

https://gerrit.wikimedia.org/r/223702

Do all the exception involve the same wiki? Do any involve some other wiki having a local user row missing while logging into the current wiki?

Lots of different languages and projects. In ever case I've seen, this always happens on the create account form (LoginForm->addNewAccount() is in the backtrace), and the user ends up with,

  • CentralAuth account is created
  • account is attached on exactly 1 wiki, the one where the exception happend
  • the local wiki account has been created in the DB

It's interesting that register() does not use a trx but attach() does (due to https://gerrit.wikimedia.org/r/#/c/154443/). Out of curiosity, I wonder what that ways for. If the local/central DBs both had end-of-request commits there would less time for them to be out of sync.

@Legoktm mentioned attach. I think he added those?

Btw, queryAttached() seems kind of inefficient, loading all CA/localuser rows for a user and doing a local user query for each one. I don't see too many callers now (not at key use cases), and some of them are just fallbacks for filling a denormalized value (home/editcount). For users on many wikis, it seems like this would slow down login, though perhaps not by too much in regular cases.

Yeah, since logins are (relatively) rare, I'm hopeful the impact isn't noticeable. I also want to fix T99007 which needs a call to queryAttached(), so doing the full lookup once on login seems unavoidable, unless we cache local groups in a central location.

It's interesting that register() does not use a trx but attach() does (due to https://gerrit.wikimedia.org/r/#/c/154443/). Out of curiosity, I wonder what that ways for. If the local/central DBs both had end-of-request commits there would less time for them to be out of sync.

It was for T41996. Hoo and I discussed that bug IRL during Wikimania and that one solution ended up working...

I wonder if checkPasswordValidity() had a $purpose argument string (newaccount/newpassword/login), passed down to the hook. If the CA hooks could then avoid calling the attachment state methods for new accounts since you don't care about their privileges. Would that also avoid these problems?

Change 224201 had a related patch set uploaded (by Aaron Schulz):
Made the prior user existence check in LoginForm use DB_MASTER

https://gerrit.wikimedia.org/r/224201

Change 224201 merged by jenkins-bot:
Made the prior user existence check in LoginForm use DB_MASTER

https://gerrit.wikimedia.org/r/224201

Thanks Aaron.

@Legoktm, https://gerrit.wikimedia.org/r/#/c/223702/ should be ready to merge now if you have a few minutes.

I wonder if checkPasswordValidity() had a $purpose argument string (newaccount/newpassword/login), passed down to the hook. If the CA hooks could then avoid calling the attachment state methods for new accounts since you don't care about their privileges. Would that also avoid these problems?

I'd still recommend considering this too.

Change 224658 had a related patch set uploaded (by CSteipp):
Add "purpose" to password validity check

https://gerrit.wikimedia.org/r/224658

I wonder if checkPasswordValidity() had a $purpose argument string (newaccount/newpassword/login), passed down to the hook. If the CA hooks could then avoid calling the attachment state methods for new accounts since you don't care about their privileges. Would that also avoid these problems?

I'd still recommend considering this too.

Done in https://gerrit.wikimedia.org/r/224658. That simplified the installer a little, although it still feel a little hacky imo.

Change 224658 merged by jenkins-bot:
Add "purpose" to password validity check

https://gerrit.wikimedia.org/r/224658

All patches mentioned in this task have been merged more than two weeks ago.

@csteipp: What exactly is left to do here, and should this task still be "Unbreak now" priority or can that be lowered?

Yeah, let's close this, and we can see what happens when I get the centralauth patch that caused this re-merged.

Change 223702 merged by jenkins-bot:
Re-add global password policies

https://gerrit.wikimedia.org/r/223702

Change 251009 had a related patch set uploaded (by CSteipp):
Re-add global password policies

https://gerrit.wikimedia.org/r/251009

Change 251009 abandoned by CSteipp:
Re-add global password policies

https://gerrit.wikimedia.org/r/251009