Page MenuHomePhabricator

GlobalRename is broken, presumably due to authmanager changes
Closed, ResolvedPublic

Description

2016-05-18 08:24:57 [VzwKsQpAMFQAALQ69ZgAAAAH] mw1007 mediawikiwiki 1.28.0-wmf.2 exception ERROR: [VzwKsQpAMFQAALQ69ZgAAAAH] [no req]   MWException from line 3931 of /srv/mediawiki/php-1.28.0-wmf.2/includes/user/User.php: CAS update failed on user_touched for user ID '6343509' (read from slave); the version of the user to be saved is older than the current version. {"exception_id":"VzwKsQpAMFQAALQ69ZgAAAAH"} 
[Exception MWException] (/srv/mediawiki/php-1.28.0-wmf.2/includes/user/User.php:3931) CAS update failed on user_touched for user ID '6343509' (read from slave); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.28.0-wmf.2/includes/session/SessionManager.php(306): User->saveSettings()
  #1 /srv/mediawiki/php-1.28.0-wmf.2/extensions/Renameuser/RenameuserSQL.php(199): MediaWiki\Session\SessionManager->invalidateSessionsForUser(User)
  #2 /srv/mediawiki/php-1.28.0-wmf.2/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php(59): RenameuserSQL->rename()
  #3 /srv/mediawiki/php-1.28.0-wmf.2/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php(27): LocalRenameUserJob->doRun()
  #4 /srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/JobRunner.php(265): LocalRenameJob->run()
  #5 /srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/JobRunner.php(179): JobRunner->executeJob(LocalRenameUserJob, BufferingStatsdDataFactory, integer)
  #6 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
  #7 {main}

Renames are currently failing on mediawiki.org right now

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There are still global rename requests that are still not finished. Can someone fix this (again)?

In T135656#2312257, @Pokefan95 wrote:

There are still global rename requests that are still not finished. Can someone fix this (again)?

+1

Mentioned in SAL [2016-05-23T19:11:43Z] <legoktm> manually kicking stuck global renames (T135656)

is this still an issue? I'm trying to close out wmf.2 as wmf.3 is taking it's place shortly.

This was left open because some rename attempts were interrupted by the bug and needed cleanup. Not sure whether that happened already. The bug itself has been fixed.

The cleanup happened. However T135954: MWException on global rename: CAS update failed on gu_cas_token was a failed rename with the same stacktrace except *after* the patch was deployed.

It's not the same stack trace, look closer.

There are three CA user writes (and CAS increases) in succession: User::saveSettings() at the beginning of invalidateSessionsForUser() triggers one (with master data), then AuthPlugin::getUserInstance()->resetAuthToken() another (with slave data), then the provider's invalidateSessionsForUser handler a third one (with master data). Either the slave data is wrong (although in that case I would expect the second write to fail), or the in-process cache is not invalidated properly.

Thanks Gergő. The slave data one is wrong. What can happen there is that the slave write succeeds but then reloads the just-saved data from a slave that hasn't gotten the update yet.

Patch for that is in https://gerrit.wikimedia.org/r/291229.

This comment was removed by Poyekhali.

Seems to only be affecting that one rename, as all other rename requests are completing successfully.

Actually, forget what I said earlier. It's the same bug, apparently, with renames failing on MediaWiki.org again. https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress

Rename is broken in MediaWiki (presumably either due to wmf.5, or AuthManager getting enabled).

Query: UPDATE  `localnames` SET ln_name = 'TestRenameandpaws' WHERE ln_wiki = 'mediawikiwiki' AND ln_name = 'TestRenamepaws'
Function: CentralAuthUser::updateLocalName
Error: 1062 Duplicate entry 'mediawikiwiki-TestRenameandpaws' for key 'PRIMARY' (10.64.16.30)

  #0 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/Database.php(901): DatabaseBase->reportQueryError(string, integer, string, string, boolean)
  #1 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/Database.php(1512): DatabaseBase->query(string, string)
  #2 /srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/CentralAuthUser.php(2056): DatabaseBase->update(string, array, array, string)
  #3 /srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/CentralAuthHooks.php(717): CentralAuthUser->updateLocalName(string, string)
  #4 /srv/mediawiki/php-1.28.0-wmf.5/includes/Hooks.php(195): CentralAuthHooks::onRenameUserComplete(integer, string, string)
  #5 /srv/mediawiki/php-1.28.0-wmf.5/extensions/Renameuser/RenameuserSQL.php(354): Hooks::run(string, array)
  #6 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/Database.php(2482): Closure$RenameuserSQL::rename()
  #7 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/Database.php(2689): DatabaseBase->runOnTransactionIdleCallbacks()
  #8 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LoadBalancer.php(1079): DatabaseBase->commit(string, string)
  #9 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactory.php(196): LoadBalancer->commitMasterChanges(string)
  #10 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactoryMulti.php(411): Closure$LBFactory::forEachLBCallMethod(LoadBalancer, string, array)
  #11 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactory.php(199): LBFactoryMulti->forEachLB(Closure$LBFactory::forEachLBCallMethod;240668706, array)
  #12 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactory.php(240): LBFactory->forEachLBCallMethod(string, array)
  #13 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(543): LBFactory->commitMasterChanges(string)
  #14 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(267): JobRunner->commitMasterChanges(LocalRenameUserJob)
  #15 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(179): JobRunner->executeJob(LocalRenameUserJob, BufferingStatsdDataFactory, integer)
  #16 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)

The rename log has duplicate records. So it seems like the jobs are triggered twice.

48c319b7a54c709ec5c172d704b2750267c945c5 seems like the most relevant patch, but the only thing that happens on UserSaveSettings is the central user being saved and that seems harmless.

There are also a bunch of CAS errors:

/srv/mediawiki/php-1.28.0-wmf.5/includes/user/User.php:3914 CAS update failed on user_touched for user ID '6540927' (read from slave); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.28.0-wmf.5/extensions/Renameuser/RenameuserSQL.php(353): User->saveSettings()
  #1 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/Database.php(2482): Closure$RenameuserSQL::rename()
  #2 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/Database.php(2689): DatabaseBase->runOnTransactionIdleCallbacks()
  #3 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LoadBalancer.php(1079): DatabaseBase->commit(string, string)
  #4 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactory.php(196): LoadBalancer->commitMasterChanges(string)
  #5 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactoryMulti.php(411): Closure$LBFactory::forEachLBCallMethod(LoadBalancer, string, array)
  #6 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactory.php(199): LBFactoryMulti->forEachLB(Closure$LBFactory::forEachLBCallMethod;240668706, array)
  #7 /srv/mediawiki/php-1.28.0-wmf.5/includes/db/loadbalancer/LBFactory.php(240): LBFactory->forEachLBCallMethod(string, array)
  #8 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(543): LBFactory->commitMasterChanges(string)
  #9 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(267): JobRunner->commitMasterChanges(LocalRenameUserJob)
  #10 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(179): JobRunner->executeJob(LocalRenameUserJob, BufferingStatsdDataFactory, integer)
  #11 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)

also CA CAS errors:

/srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/CentralAuthUser.php:2584 CAS update failed on gu_cas_token for user ID '46867163' (read from master); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/CentralAuthUser.php(2533): CentralAuthUser->saveSettings()
  #1 /srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php(155): CentralAuthUser->resetAuthToken()
  #2 /srv/mediawiki/php-1.28.0-wmf.5/includes/session/SessionManager.php(314): CentralAuthTokenSessionProvider->invalidateSessionsForUser(User)
  #3 /srv/mediawiki/php-1.28.0-wmf.5/extensions/Renameuser/RenameuserSQL.php(200): MediaWiki\Session\SessionManager->invalidateSessionsForUser(User)
  #4 /srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php(59): RenameuserSQL->rename()
  #5 /srv/mediawiki/php-1.28.0-wmf.5/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php(27): LocalRenameUserJob->doRun()
  #6 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(265): LocalRenameJob->run()
  #7 /srv/mediawiki/php-1.28.0-wmf.5/includes/jobqueue/JobRunner.php(179): JobRunner->executeJob(LocalRenameUserJob, BufferingStatsdDataFactory, integer)
  #8 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)

Maybe those cause the job to be retried? Although they seem to happen earlier than the RenameUserComplete call...

Maybe those cause the job to be retried?

Probably not, they seem to be less frequent than the duplicate key errors.

From the log table records, this seems somewhat random: some renames are attempted twice, some three times, some only once. There is always exactly one hour between attempts, so it does look like failed job retries, after all.

Change 293285 had a related patch set uploaded (by Gergő Tisza):
Use master DB when touching the user to signal rename end

https://gerrit.wikimedia.org/r/293285

That fixes the User CAS issue, I think. No idea about the CA one. There are a number of CA user instance accesses by the time the job gets to the invalidateSessionsForUser call, but all of them are created via CentralAuthUser::getMasterInstance so neither slave lag nor the in-process cache getting outdated seems to be the problem. Also, very little of this code is affected by AuthManager being enabled; the only difference I can see it that SessionManager::invalidateSessionsForUser skips the resetAuthToken call via the authplugin user.

mysql:wikiadmin@db1077 [mediawikiwiki]> select l1.log_timestamp from logging l1 join logging l2 where l1.log_type = 'renameuser' and l2.log_type = 'renameuser' and l1.log_timestamp > '20160605000000' and l2.log_timestamp > '20160605000000' and l1.log_id > l2.log_id and l1.log_title = l2.log_title order by l1.log_timestamp desc;
+----------------+
| log_timestamp  |
+----------------+
| 20160608111934 |
| 20160608091704 |
| 20160608084301 |
| 20160608084233 |
| 20160608081317 |
| 20160608081241 |
| 20160608081104 |
| 20160608035637 |
| 20160608033417 |
| 20160607224212 |
| 20160607223609 |
| 20160607220450 |
| 20160607220431 |
| 20160607220258 |
+----------------+

So duplicate inserts started at 22h UTC which means the first job error was at 21h UTC. Train was deployed at 20h UTC and AuthManager at 23:20 UTC so this was caused by the train.

There are also other accounts affected (apart from Acee8):

MariaDB [centralauth_p]> select * from renameuser_status WHERE ru_oldname != "Acee8";
+--------------------------+-----------------------------+---------------+------------+
| ru_oldname               | ru_newname                  | ru_wiki       | ru_status  |
+--------------------------+-----------------------------+---------------+------------+
| Alessandrodifrancesco    | AlessandroDiFrancesco       | mediawikiwiki | inprogress |
| Amor Verlag              | Amor Verlag GmbH            | mediawikiwiki | inprogress |
| Defindia                 | ShaifaliC                   | mediawikiwiki | inprogress |
| Dr Dinabandhu sahoo      | Dinabandhu sahoo            | mediawikiwiki | inprogress |
| GarrattPublishing        | Garratt.P                   | mediawikiwiki | inprogress |
| IngQuim                  | Nterikxouagg                | mediawikiwiki | inprogress |
| Intotherei               | Renamed user g82jgk209f31da | mediawikiwiki | inprogress |
| Miladghaznavi            | Babak Kavian                | mediawikiwiki | inprogress |
| Miskuzmidaku             | ETaraku                     | mediawikiwiki | inprogress |
| Offshore Heavy Transport | Jwilskow                    | mediawikiwiki | inprogress |
| Piello                   | Pino995                     | mediawikiwiki | inprogress |
| Poriya78                 | Poriyamh                    | mediawikiwiki | inprogress |
| Radenmonicasilvia        | Raden Monica Silvia         | mediawikiwiki | inprogress |
| Ralzune                  | Androiduser96EU             | mediawikiwiki | inprogress |
| TestRenamepaws           | TestRenameandpaws           | mediawikiwiki | inprogress |
| ささかわ                 | ササカワドクソ              | mediawikiwiki | inprogress |
+--------------------------+-----------------------------+---------------+------------+
16 rows in set (0.00 sec)

Change 293285 merged by jenkins-bot:
Use master DB when touching the user to signal rename end

https://gerrit.wikimedia.org/r/293285

Change 293383 had a related patch set uploaded (by Thcipriani):
Use master DB when touching the user to signal rename end

https://gerrit.wikimedia.org/r/293383

Change 293383 merged by jenkins-bot:
Use master DB when touching the user to signal rename end

https://gerrit.wikimedia.org/r/293383

Vituzzu made two test renames after the backport, and they worked, so I think this is fixed (if not we will know soon since wmf.5 is now on all non-wikipedias).

I cleaned up the bad localnames entries that were behind the exception in T135656#2363474, just now.

Seems to be working, recent renames seem to be going through. Of course, we need to manually complete the ones stuck in limbo at Special:GlobalRenameProgress.

The stuck ones are probably completed, the job just failed to exit successfully. Not sure how to unstuck them, maybe delete the renameuser_status rows?

As a side note, it would be nice to move session/auth token resets to just using the session storage class (e.g. redis). Do these really need to do heavy weight User object updates? It seems like a lot of problems could be avoided that way.

The user_token field in the database (and gu_auth_token for CA) need to be updated too, otherwise anyone who had logged in with the "remember me" check box checked will just start a new logged-in session thanks to the token cookie. And if you're updating those fields you have to update the CAS fields, and then you haven't solved anything.

Moving the token fields out of the DB and into redis seems reasonable, although I think some things rely on the user CAS updates (like the active-in-last-30-days statistics).

Can we count on the backend store (redis or whatever else a third party might use) to keep those tokens for the duration of $wgExtendedLoginCookies, especially considering T68699: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis? Or are the backend stores we're talking about reasonably likely to have eviction or non-persistence across restarts?

Redis can persist its data to disk and act like a noSQL database, although I have no idea if the Wikimedia instances are configured to do that.

Then again, just using a separate table for tokens might give the same benefits with less complexity.

Per memcached.pp, we use rdb snapshots for redis persistence. If we replace storage with cassandra, it will be ssd backed. Either way, we have persistence now (and will continue to).

''We'' do, sure. But let's make sure we don't break every non-WMF wiki while we're at it.

Change 293782 had a related patch set uploaded (by Gergő Tisza):
Use master DB when touching the user to signal rename end

https://gerrit.wikimedia.org/r/293782

Change 293782 merged by jenkins-bot:
Use master DB when touching the user to signal rename end

https://gerrit.wikimedia.org/r/293782

Ok, I unstuck the following 16 renames that seem to have already been completed some time ago:

+--------------------------+-----------------------------+---------------+------------+
| ru_oldname               | ru_newname                  | ru_wiki       | ru_status  |
+--------------------------+-----------------------------+---------------+------------+
| Amor Verlag              | Amor Verlag GmbH            | mediawikiwiki | inprogress |
| Defindia                 | ShaifaliC                   | mediawikiwiki | inprogress |
| Dr Dinabandhu sahoo      | Dinabandhu sahoo            | mediawikiwiki | inprogress |
| GarrattPublishing        | Garratt.P                   | mediawikiwiki | inprogress |
| IngQuim                  | Nterikxouagg                | mediawikiwiki | inprogress |
| Intotherei               | Renamed user g82jgk209f31da | mediawikiwiki | inprogress |
| Miladghaznavi            | Babak Kavian                | mediawikiwiki | inprogress |
| Miskuzmidaku             | ETaraku                     | mediawikiwiki | inprogress |
| Offshore Heavy Transport | Jwilskow                    | mediawikiwiki | inprogress |
| Onjalola                 | Paris1895                   | mediawikiwiki | inprogress |
| Piello                   | Pino995                     | mediawikiwiki | inprogress |
| Poriya78                 | Poriyamh                    | mediawikiwiki | inprogress |
| Radenmonicasilvia        | Raden Monica Silvia         | mediawikiwiki | inprogress |
| Ralzune                  | Androiduser96EU             | mediawikiwiki | inprogress |
| TestRenamepaws           | TestRenameandpaws           | mediawikiwiki | inprogress |
| ささかわ                 | ササカワドクソ              | mediawikiwiki | inprogress |
+--------------------------+-----------------------------+---------------+------------+
16 rows in set (0.00 sec)

Since there haven't been any new ones showing up from last week, this should be done now.

I didn't touch that one because it wasn't the same as all the others.

There are no relevant messages in the logs for that one at all, so it's not related to the original issue (which was caused by CAS errors). I filed T137973: GlobalRename gets stuck sometimes.

exception.log-20160621.gz
2016-06-20 12:09:22 [V2fc6wpAMEoAACeWn8oAAAAO] mw1006 wikidatawiki 1.28.0-wmf.6 exception ERROR: [V2fc6wpAMEoAACeWn8oAAAAO] [no req]   MWException from line 2584 of /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php: CAS update failed on gu_cas_token for user ID '17313928' (read from master); the version of the user to be saved is older than the current version. {"exception_id":"V2fc6wpAMEoAACeWn8oAAAAO"} 
[Exception MWException] (/srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php:2584) CAS update failed on gu_cas_token for user ID '17313928' (read from master); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php(2533): CentralAuthUser->saveSettings()
  #1 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php(377): CentralAuthUser->resetAuthToken()
  #2 /srv/mediawiki/php-1.28.0-wmf.6/includes/session/SessionManager.php(314): CentralAuthSessionProvider->invalidateSessionsForUser(User)
  #3 /srv/mediawiki/php-1.28.0-wmf.6/extensions/Renameuser/RenameuserSQL.php(200): MediaWiki\Session\SessionManager->invalidateSessionsForUser(User)
  #4 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php(59): RenameuserSQL->rename()
  #5 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php(27): LocalRenameUserJob->doRun()
  #6 /srv/mediawiki/php-1.28.0-wmf.6/includes/jobqueue/JobRunner.php(265): LocalRenameJob->run()
  #7 /srv/mediawiki/php-1.28.0-wmf.6/includes/jobqueue/JobRunner.php(179): JobRunner->executeJob(LocalRenameUserJob, BufferingStatsdDataFactory, integer)
  #8 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
  #9 {main}
2016-06-20 12:09:22 [V2fc6wpAMEoAACeWn8oAAAAO] mw1006 wikidatawiki 1.28.0-wmf.6 exception ERROR: [V2fc6wpAMEoAACeWn8oAAAAO] [no req]   MWException from line 2584 of /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php: CAS update failed on gu_cas_token for user ID '17313928' (read from master); the version of the user to be saved is older than the current version. {"exception_id":"V2fc6wpAMEoAACeWn8oAAAAO"} 
[Exception MWException] (/srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php:2584) CAS update failed on gu_cas_token for user ID '17313928' (read from master); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php(2533): CentralAuthUser->saveSettings()
  #1 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php(155): CentralAuthUser->resetAuthToken()
  #2 /srv/mediawiki/php-1.28.0-wmf.6/includes/session/SessionManager.php(314): CentralAuthTokenSessionProvider->invalidateSessionsForUser(User)
  #3 /srv/mediawiki/php-1.28.0-wmf.6/extensions/Renameuser/RenameuserSQL.php(200): MediaWiki\Session\SessionManager->invalidateSessionsForUser(User)
  #4 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php(59): RenameuserSQL->rename()
  #5 /srv/mediawiki/php-1.28.0-wmf.6/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php(27): LocalRenameUserJob->doRun()
  #6 /srv/mediawiki/php-1.28.0-wmf.6/includes/jobqueue/JobRunner.php(265): LocalRenameJob->run()
  #7 /srv/mediawiki/php-1.28.0-wmf.6/includes/jobqueue/JobRunner.php(179): JobRunner->executeJob(LocalRenameUserJob, BufferingStatsdDataFactory, integer)
  #8 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
  #9 {main}

Bunch more in that log file (I grepped for "Rename"), but these two stuck out to me because it looks like the same job was running twice? Or the exception got logged twice?

@Legoktm worthy of blocking rollout of wmf.7? (we're on wmf.6 right now, obviously from those log messages)

This comment was removed by Anomie.

That's two different exceptions, both ending up at the same place from different code paths (look closely at line #1).

runJobs.log-20160621.gz
...
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1002 loginwiki 1.28.0-wmf.6 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=49bdf82bf7ca458a97a46b9c32cbd13b,timestamp=1466424556,QueuePartition=rdb1-6381) STARTING 
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1002 loginwiki 1.28.0-wmf.6 runJobs INFO: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=49bdf82bf7ca458a97a46b9c32cbd13b,timestamp=1466424556,QueuePartition=rdb1-6381) t=120 good 
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1010 enwikiquote 1.28.0-wmf.6 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=248df1c4cbec4afd873ce3ec98139781,timestamp=1466424556,QueuePartition=rdb4-6381) STARTING
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1005 specieswiki 1.28.0-wmf.6 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=3f6feafc37ed4eb59935fb6dc36ebd4f,timestamp=1466424556,QueuePartition=rdb4-6381) STARTING  
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1015 enwiki 1.28.0-wmf.6 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=61fe55ff5ddc49628f238af19fef954f,timestamp=1466424556,QueuePartition=rdb4-6381) STARTING  
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1006 wikidatawiki 1.28.0-wmf.6 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=895218a71af84138a4b08a6655db420b,timestamp=1466424556,QueuePartition=rdb4-6381) STARTING  
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1010 enwikiquote 1.28.0-wmf.6 runJobs INFO: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=248df1c4cbec4afd873ce3ec98139781,timestamp=1466424556,QueuePartition=rdb4-6381) t=351 good  
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1013 incubatorwiki 1.28.0-wmf.6 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=ccdea3d3e4604d579fc9eddb95225092,timestamp=1466424556,QueuePartition=rdb1-6381) STARTING  
2016-06-20 12:09:21 [V2fc6wpAMEoAACeWn8oAAAAO] mw1015 enwiki 1.28.0-wmf.6 runJobs INFO: LocalRenameUserJob Global_rename_job from=Frosty06306 to=ArrowheadCannibal renamer=K6ka movepages=1 suppressredirects= promotetoglobal= reason=Per [[:w:en:Special:Permalink/726132941|:en:WP:CHUS]] session={...} force= requestId=V2fc6wpAMEoAACeWn8oAAAAO (uuid=61fe55ff5ddc49628f238af19fef954f,timestamp=1466424556,QueuePartition=rdb4-6381) COMMIT ENQUEUED [186ms of writes]   
...

We've got LocalRenameUserJobs running for the same user on several different jobrunners at once. No wonder there are CAS errors with the token reset.

Yes, and that's been the design of globalrename since the beginning.

@Legoktm worthy of blocking rollout of wmf.7? (we're on wmf.6 right now, obviously from those log messages)

Not really? It's already broken and I assume wmf.7 isn't going to make it any worse.

OTOH, we haven't had CAS checking in CentralAuthUser since the beginning,
or delaying of DB writes all over the place, or probably as heavy caching
of CentralAuthUser data. When globalrename was first written it probably
loaded the data and saved the new version all within a few milliseconds,
had transactions more likely to block other processes, and just overwrote
other processes' changes on a race.

I note that the CAS failure logged here doesn't even prevent the rename (at
this time anyway), it just logs the failure and doesn't save the changed
token which doesn't actually make a difference as long as it works once on
any wiki.

I note that the CAS failure logged here doesn't even prevent the rename (at
this time anyway), it just logs the failure and doesn't save the changed
token which doesn't actually make a difference as long as it works once on
any wiki.

So the exception is just being logged and not thrown? Is there a different exception causing the jobs to actually fail then...?

How do you suggest we move forward?

The rename associated with the log entries in T135656#2396692 does not seem to have failed.

I see we currently have three renames that seem to be stuck: Acee8 → Nuevo Paso (which predates this task, although not the addition of the call to SessionManager::invalidateSessionsForUser()), Hausratte → Ratte, and बिप्लब आनन्द → Biplab Anand. At a glance, it seems that these all got stuck on a different aspect of the race condition: "Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.30)" inside CentralAuthUser::saveSettings(). It looks like the jobs only get retried a limited number of times, so if it happens to run into the lock wait each time thanks to the thundering herd of rename jobs all retrying at once, boom.

To move forward, IMO we'll have to figure out some way to deal with the race here.

The two exceptions have the same exception_id. Doesn't that mean they are running in the same batch on the same job runner (and thus sequentially)?

In any case, there is probably no point in every single job invalidating the session; that can be done a single time after the jobs have finished. That would get rid of the CAS warnings but, as Anomie pointed out, probably not fix the real problem.

The two exceptions have the same exception_id. Doesn't that mean they are running in the same batch on the same job runner (and thus sequentially)?

Correct. These two are actually from the same job, although it seems every job on the job runner (even widely separated in time, as long as the script doesn't get restarted) gets the same exception_id.

In any case, there is probably no point in every single job invalidating the session; that can be done a single time after the jobs have finished. That would get rid of the CAS warnings but, as Anomie pointed out, probably not fix the real problem.

We'd need to make sure that the user_token field on each local wiki still got reset. We'd also have to make sure not to break session resetting in the non-CentralAuth case.

What's the point in resetting user_token? Only attached accounts get globally renamed, and we don't care about local tokens in other cases of globally logging out users.

Actually, what's the point in resetting the session in the first place? For authentication systems which are bound to username, the token effectively changes anyway since the token of a different user record (if any) will be resolved for that name. For auth systems which identify the user by id, there is no danger of authenticating the wrong user since the user id does not change.

What's the point in resetting user_token?

So the user gets logged out from any existing non-CentralAuth sessions. If we're intentionally logging people out at all (rather than just as a side effect of session id/name mismatch), we should do a complete job of it.

Actually, what's the point in resetting the session in the first place? For authentication systems which are bound to username, the token effectively changes anyway since the token of a different user record (if any) will be resolved for that name. For auth systems which identify the user by id, there is no danger of authenticating the wrong user since the user id does not change.

T18721 is why the AuthPlugin::getUser()->resetAuthToken() call was added, apparently there was some sort of race where the user could wind up with CA auto-creating a local account under the old name while the rename was in progress. I have no idea if that's still relevant or if it has been fixed in another way since 2008; I note that CA's Special:GlobalRenameUser and all the infrastructure around that wasn't added until 2013 (2af71e7153a).

The AuthManager-specific bugs were fixed and the current issue is tracked in T137973. No point in keeping two UBN tasks open about the same issue.

Can't the mediawiki accounts just be reattached to the newly-renamed global accounts by a steward?

Can't the mediawiki accounts just be reattached to the newly-renamed global accounts by a steward?

Tracked in T140074: Attach Biplab Anand's local accounts. If you know of other users in need of fixing, please comment there.