Page MenuHomePhabricator

SULF renameuser jobs slow on mediawikiwiki and queries timing out
Closed, ResolvedPublic

Description

I'm not really sure what's going on, but some renameuser queries are timing out on mediawikiwiki, causing the rename jobs to fail (they typically succeed the next time the job queue retries them). I haven't seen this on any other wiki yet. Overall they're also running slower compared to other wikis.

Examples:

2015-04-18 19:46:58 mw1006 mediawikiwiki: [e55535e1] /rpc/RunJobs.php?wiki=media
wikiwiki&type=LocalRenameUserJob&maxtime=60&maxmem=300M   DBQueryError from line
 1246 of /srv/mediawiki/php-1.26wmf2/includes/db/Database.php: A database error 
has occurred. Did you forget to run maintenance/update.php after upgrading?  See
: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: UPDATE  `filearchive` SET fa_user_text = 'John Morawski~mediawikiwiki' WH
ERE fa_user_text = 'John Morawski' AND fa_user = '12243'
Function: RenameuserSQL::rename
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.27)
2015-04-18 19:46:42 mw1015 mediawikiwiki: [32b6040b] /rpc/RunJobs.php?wiki=media
wikiwiki&type=LocalRenameUserJob&maxtime=60&maxmem=300M   DBQueryError from line
 1246 of /srv/mediawiki/php-1.26wmf2/includes/db/Database.php: A database error 
has occurred. Did you forget to run maintenance/update.php after upgrading?  See
: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: UPDATE  `filearchive` SET fa_user_text = 'Jodie~mediawikiwiki' WHERE fa_u
ser_text = 'Jodie' AND fa_user = '14621'
Function: RenameuserSQL::rename
Error: 2013 Lost connection to MySQL server during query (10.64.16.27)
2015-04-18 19:47:41 mw1011 mediawikiwiki: [ae171a63] /rpc/RunJobs.php?wiki=media
wikiwiki&type=LocalRenameUserJob&maxtime=60&maxmem=300M   DBQueryError from line
 1246 of /srv/mediawiki/php-1.26wmf2/includes/db/Database.php: A database error 
has occurred. Did you forget to run maintenance/update.php after upgrading?  See
: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: UPDATE  `logging` SET log_title = 'Johny~mediawikiwiki' WHERE log_type IN ('block','newusers','rights','renameuser','thanks')  AND log_namespace = '2' AND log_title = 'Johny'
Function: RenameuserSQL::rename
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.27)

I can probably find more examples if that is helpful.

Event Timeline

Legoktm raised the priority of this task from to High.
Legoktm updated the task description. (Show Details)
Legoktm added subscribers: Legoktm, aaron, Springle and 3 others.

Change 205070 had a related patch set uploaded (by Legoktm):
Use DB_SLAVE in CentralAuthUser::importLocalNames()

https://gerrit.wikimedia.org/r/205070

From #wikimedia-operations:

[18:26:14] <springle> legoktm: 22K queries/sec on S3 master atm. Not surprised that you saw some lock wait timeouts earlier :)
[18:26:51] <legoktm> springle: uhoh :/ should I slow down?
[18:27:10] <legoktm> right now I'm on arwiki which is s7
[18:27:47] <springle> these are from your jobs? : SELECT /* CentralAuthUser::importLocalNames
[18:28:46] <legoktm> yes
[18:29:40] <springle> if we can throttle that select load back a bit it would help S3 db1038
[18:31:18] <legoktm> are those queries still showing up on s3? I shouldn't be running anything on that db anymore...
[18:32:51] <legoktm> the centralauth db is on db1033 right? not 1038?
[18:33:02] <springle> they are, yes. spiked for a few hours earlier. paused a while, then spiked again 30m ago
[18:33:07] <springle> correct
[18:34:25] <springle> all from jobrunners
[18:36:27] <legoktm> lowering the limit from 85 concurrent renames to 50
[18:38:02] <legoktm> ok, this code is doing a query on every single wiki to see if the user exists :/
[18:38:17] <springle> S3 master load dropping back
[18:38:24] <springle> happier now
[18:38:58] <springle> probably wouldn't have noticed if that SELECT traffic had used a slave
[18:39:41] <legoktm> so the old rate would have been fine if it were using a slave?
[18:39:58] <springle> yep. probably also only an issue for S3

I've seen these lock waits on prior runs. They are probably for users with no actual deleted edits, so the gap endpoints are some adjacent user rows that do. So if similarly named users are being processed, the jobs for them will serialize with INSERT queries.

Even if they did have rows, if it's not a unique index query it would like the adjacent gaps anyway of course.

Anyway, the real problem is that lots of users may have no deleted edits, making the gaps fairly wide I suspect.

So should https://gerrit.wikimedia.org/r/#/c/205070/ get reviewed/merged?
Or does this task slowly get obsolete, as SUL-Finalization is nearly finished?

The majority of renaming on mw.org is done now, so the impact won't be huge. If it's a useful fix in general, we should merge it, but we can wait for @Legoktm to get back from vacation.

Change 205070 merged by jenkins-bot:
Use DB_SLAVE in CentralAuthUser::importLocalNames()

https://gerrit.wikimedia.org/r/205070

Legoktm claimed this task.