Page MenuHomePhabricator

"Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Lately we had some amount of contention due to fast edits by a user (around 1 per second) while executing the following sql:

UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = '<user_id>' AND (user_editcount IS NOT NULL)

For example:

{
  "_index": "logstash-2018.08.24",
  "_type": "mediawiki",
  "_id": "AWVrBGaToOODFPKvBW2D",
  "_version": 1,
  "_score": null,
  "_source": {
    "server": "commons.wikimedia.org",
    "db_server": "10.64.48.23",
    "wiki": "commonswiki",
    "channel": "DBQuery",
    "type": "mediawiki",
    "error": "Lock wait timeout exceeded; try restarting transaction (10.64.48.23)",
    "http_method": "POST",
    "@version": 1,
    "host": "mw1221",
    "shard": "s4",
    "sql1line": "UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = '1779721' AND (user_editcount IS NOT NULL)",
    "fname": "User::incEditCountImmediate",
    "errno": 1205,
    "unique_id": "W3@-gwpAMDgAAH1gJiIAAACQ",
    "method": "Wikimedia\\Rdbms\\Database::makeQueryException",
    "level": "ERROR",
    "ip": "10.64.32.69",
    "mwversion": "1.32.0-wmf.18",
    "message": "User::incEditCountImmediate\t10.64.48.23\t1205\tLock wait timeout exceeded; try restarting transaction (10.64.48.23)\tUPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = '1779721' AND (user_editcount IS NOT NULL)",
    "normalized_message": "{fname}\t{db_server}\t{errno}\t{error}\t{sql1line}",
    "url": "/w/api.php",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "W3@-gwpAMDgAAH1gJiIAAACQ",
    "referrer": "https://commons.wikimedia.org/w/index.php?title=Special:Search&limit=500&offset=0&profile=default&search=Krishna+Tirath&searchToken=5its02a4fpzskfv4z7pga8smd",
    "@timestamp": "2018-08-24T08:19:35.000Z",
    "db_name": "commonswiki",
    "db_user": "wikiuser"
  },
  "fields": {
    "@timestamp": [
      1535098775000
    ]
  },
  "sort": [
    1535098775000
  ]
}

While I don't see a lot of problems with this- UPDATE inside the transaction needs to lock the user row to update consistently the row count on write, the rate of edits is around 1 per second, and we have seen faster bots in the past doing a larger amount of edits, so I am worried about the limits this causes on edit rate in general. Couldn't the count be done asynchronously?

Event Timeline

jcrespo created this task.
Krinkle raised the priority of this task from Low to Needs Triage.Aug 29 2018, 8:28 PM
Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.
aaron renamed this task from Lock wait timeout exceeded when doing fast edits due to articule edit count locking to Lock wait timeout exceeded when doing fast edits due to article edit count locking.Aug 29 2018, 8:56 PM

All calls to incEditCountImmediate currently move it to the end of the transaction. According to logstash << +channel:DBPerformance +"user_editcount=user_editcount+N" +"sub-optimal" >> it seems to usually be very fast. Though I see occasional entries a little over 1 second. I suppose in that case, a fast enough edit rate by a single use could make a pile-up. I wonder if the delay comes from COMMIT itself?

Krinkle renamed this task from Lock wait timeout exceeded when doing fast edits due to article edit count locking to "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).Sep 12 2018, 4:24 PM

There was a burst of 330 of these shortly after the switch-over today. Example:

/w/api.php
Query: UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = ##### AND (user_editcount IS NOT NULL)
Function: User::incEditCountImmediate
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.192.0.39)

#0 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1428): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1198): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(2074): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(5330): Wikimedia\Rdbms\Database->update(string, array, array, string)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(5307): User->incEditCountImmediate()
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(3509): Closure$User::incEditCount(Wikimedia\Rdbms\DatabaseMysqli)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1316): Wikimedia\Rdbms\Database->runOnTransactionPreCommitCallbacks()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1774): Closure$Wikimedia\Rdbms\LoadBalancer::finalizeMasterChanges(Wikimedia\Rdbms\DatabaseMysqli)
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1317): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection(Closure$Wikimedia\Rdbms\LoadBalancer::finalizeMasterChanges;429)
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(265): Wikimedia\Rdbms\LoadBalancer->finalizeMasterChanges()
#10 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(425): Closure$Wikimedia\Rdbms\LBFactory::commitMasterChanges(Wikimedia\Rdbms\LoadBalancer)
#11 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(266): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure$Wikimedia\Rdbms\LBFactory::commitMasterChanges;167)
#12 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(603): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#13 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(548): MediaWiki::preOutputCommit(DerivativeContext)
#14 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#15 /srv/mediawiki/php-1.32.0-wmf.20/api.php(87): ApiMain->execute()
/w/index.php
Query: UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = ### AND (user_editcount IS NOT NULL)
Function: User::incEditCountImmediate
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.192.0.39)

#0 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1428): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1198): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(2074): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(5330): Wikimedia\Rdbms\Database->update(string, array, array, string)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(5307): User->incEditCountImmediate()
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(3509): Closure$User::incEditCount(Wikimedia\Rdbms\DatabaseMysqli)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1316): Wikimedia\Rdbms\Database->runOnTransactionPreCommitCallbacks()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1774): Closure$Wikimedia\Rdbms\LoadBalancer::finalizeMasterChanges(Wikimedia\Rdbms\DatabaseMysqli)
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1317): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection(Closure$Wikimedia\Rdbms\LoadBalancer::finalizeMasterChanges;426)
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(265): Wikimedia\Rdbms\LoadBalancer->finalizeMasterChanges()
#10 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(425): Closure$Wikimedia\Rdbms\LBFactory::commitMasterChanges(Wikimedia\Rdbms\LoadBalancer)
#11 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(266): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure$Wikimedia\Rdbms\LBFactory::commitMasterChanges;167)
#12 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(603): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#13 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(576): MediaWiki::preOutputCommit(RequestContext, Closure$MediaWiki::main;344)
#14 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(884): MediaWiki->doPreOutputCommit(Closure$MediaWiki::main;344)
#15 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#16 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()

Note these had increased before the switch, but the switch due to performance impact would make those even more frequent.

I don't see many of these in the logs for the last 7 days. This is likely caused by editing in parallel (multiple rollback tabs at once).

I don't see many of these in the logs for the last 7 days. This is likely caused by editing in parallel (multiple rollback tabs at once).

Yeah, this particular one is rare, but not absent. Together with a dozen or so other lock-related issues, it's a fair amount of noise together. Perhaps they're unrelated, but I'd like to better understand this one.

What would it take to make this impossible in the code, and what is currently causing it?

  • If I understand correctly, the lock is implicit as part of the UPDATE query for the given user's row. There is no explicit lock, is that right?
  • I assume this isn't a deadlock where two edit requests are blocked on each other due to strange interleaving, but rather one is waiting for the next, and so on. If that's the case, then I assume it times out because the long is held for too long, longer than just the update query. Is that right?

Yes and yes. I think if COMMIT takes a few seconds, then even with this UPDATE near the transaction end, multiple writes can still pile up if enough tabs are opened or other things locking user rows are going on.

CAS errors on user might also help pinpoint some causes.

@aaron and @Krinkle are going to take a swing at figuring out why this is happening before deciding whether to tackle right away.

aaron triaged this task as Medium priority.
aaron moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Change 469159 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Move user_editcount updates to a mergeable post-send update

https://gerrit.wikimedia.org/r/469159

I believe this affects mostly commonswiki, it regularly shows:

UPDATE /* User::incEditCountImmediate */ `user` SET user_editcount=user_editcount+1 WHERE user_id = '2*****' AND (user_editcount IS NOT NULL) /* 365d5c1226605c788c13ded522249766 db1068 commonswiki 2s */

as the slow query, and the query is well indexed, so clearly is contention. I would be ok with doing nothing and declaring this as a non supported pattern- although it would be nice to check why this happens and that we are not doing very large transactions or the contention is somewhere else (e.g. new normalized schema due to MCR) that will bite us in the future.

COMMIT takes a few seconds

It is unlikely for a commit to take a lot, as in, the actual physical changes that happen then- at commit time there is only a metadata change + flush to disk cache, which should be very fast. Look for contention somewhere as a first option when you see commit taking a long time (e.g. large transactions blocking each other).

COMMIT takes a few seconds

It is unlikely for a commit to take a lot, as in, the actual physical changes that happen then- at commit time there is only a metadata change + flush to disk cache, which should be very fast. Look for contention somewhere as a first option when you see commit taking a long time (e.g. large transactions blocking each other).

I generally assume COMMIT takes almost no time, but I do sometimes see the query to COMMIT taking a long time. I'm not entirely sure what causes that sometimes. I don't know of any high level contention that causes that.

Change 469159 merged by jenkins-bot:
[mediawiki/core@master] Move user_editcount updates to a mergeable deferred update

https://gerrit.wikimedia.org/r/469159

Change 470063 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.33.0-wmf.1] Move user_editcount updates to a mergeable deferred update

https://gerrit.wikimedia.org/r/470063

Change 470063 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.1] Move user_editcount updates to a mergeable deferred update

https://gerrit.wikimedia.org/r/470063

It looks like the errors come from some tool (JS?) that fires a bunch of API requests from a Special:Search tab to edit numerous pages in parallel. Each burst always for a certain user ID with a single referrer URL.

Change 470169 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make UserEditCountUpdate faster by using auto-commit mode

https://gerrit.wikimedia.org/r/470169

Change 470169 merged by jenkins-bot:
[mediawiki/core@master] Make UserEditCountUpdate faster by using auto-commit mode

https://gerrit.wikimedia.org/r/470169

Change 470297 had a related patch set uploaded (by Paladox; owner: Aaron Schulz):
[mediawiki/core@REL1_32] Move user_editcount updates to a mergeable deferred update

https://gerrit.wikimedia.org/r/470297

Change 470297 abandoned by Paladox:
Move user_editcount updates to a mergeable deferred update

https://gerrit.wikimedia.org/r/470297

It might be that the cause lays deeper than we thought. The above code changes look good, make a ton of sense, and I believe do eliminate much of the potential for this this kind of problem.

Unfortunately, the scenario currently hitting us in production does not appear to have improved.

capture.png (512×2 px, 40 KB)

Below are the error reports from contributions by user Túrelio (user_id 62513) between 20:00 and 20:15 in Logstash:

  1. 2018-11-14 20:16:22 UPDATE user SET user_editcount=user_editcount+1 ... UserEditCountUpdate ... Error: 1205 Lock wait timeout exceeded
  2. 2018-11-14 20:16:23 UPDATE user SET user_editcount=user_editcount+1 ... UserEditCountUpdate ... Error: 1205 Lock wait timeout exceeded
  3. 2018-11-14 20:16:37 UPDATE user SET user_editcount=user_editcount+1 ... UserEditCountUpdate ... Error: 1205 Lock wait timeout exceeded
  4. 2018-11-14 20:16:38 UPDATE user SET user_editcount=user_editcount+1 ... UserEditCountUpdate ... Error: 1205 Lock wait timeout exceeded
  5. 2018-11-14 20:16:56 UPDATE user SET user_editcount=user_editcount+1 ... UserEditCountUpdate ... Error: 1205 Lock wait timeout exceeded
  6. 2018-11-14 20:16:59 UPDATE user SET user_editcount=user_editcount+1 ... UserEditCountUpdate ... Error: 1205 Lock wait timeout exceeded

This seems to align with the edits at Special:Contributions. However, these aren't dozens or hundreds edits in the same minute with a few failing.

The user made only 6 edits during the 20:16 minute, and 3 edits during the 20:17 minute. I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.

I believe our database is able to handle six simple write transactions (changing one field in one row) within the wait timeout (15 seconds per innodb_lock_wait_timeout). The six arrive, the first one goes through, the others wait until they go through, one after the other.

Given that such writes take far less than 1 second (usually <10 ms), and given that all 6 are failing (even the first ones), I suspect it is some sort of deadlock.

I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.

From that I would say the locking of the row happens previously, on the same piece of code for another user or more likely at a different piece of code that locks the same rows, and just happens that the editing of the edit count is frequent enough to be only the "sufferer".

We could setup some monitoring so that we identify what is locking the rows when that happens for debugging.

Looking at the logs, the issue (lock wait timeout) I see now is with User::loadFromDatabase (SELECT user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id FROM user LEFT JOIN actor user_actor ON ((user_actor.actor_user = user_id)) WHERE user_id = 'XXXXXX' LIMIT 1 FOR UPDATE) so maybe this can be closed as resolved?

After all, locks (contention) don't get "solved" it just moves to a lower level :-)

I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.

From that I would say the locking of the row happens previously, on the same piece of code for another user or more likely at a different piece of code that locks the same rows, and just happens that the editing of the edit count is frequent enough to be only the "sufferer".

We could setup some monitoring so that we identify what is locking the rows when that happens for debugging.

I agree with this...and suspect something like BetaFeatures or some other user prefs update pattern.

I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.

From that I would say the locking of the row happens previously, on the same piece of code for another user or more likely at a different piece of code that locks the same rows, and just happens that the editing of the edit count is frequent enough to be only the "sufferer".

We could setup some monitoring so that we identify what is locking the rows when that happens for debugging.

Sort of: T198755: Log the query that caused a lock timeout

aaron removed aaron as the assignee of this task.Feb 25 2019, 9:41 PM
aaron lowered the priority of this task from Medium to Low.Jun 6 2019, 10:19 AM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

The code has moved from User::incEditCountImmediate to User::initEditCountInternal and now runs via a DeferredUpdate (class UserEditCountUpdate).

However, lock timeouts do still happen from time to time (about once or twice a week over past 30 days). There is definitely still a deeper problem here because someone making a small handful of edits around the same time (3 or 4) absolutely should not be reaching a bottleneck for incrementing a simple counter with such a seemingly simple query.

The query is performed without a wrapping transaction. Plain auto-commit without start/commit buffering. It is a single query, that increments an int field in a row selected by its primary key:

UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = '2552146'

Yet, it seems that when this happens from 3 or 4 different servers around the same time for the same user, the query is aborted after several seconds by Lock wait timeout exceeded.

Recent sample from Logstash:

2019-11-21T15:06:11
reqId: Xdan0gpAICoAADKFdu4AAACS`

Deferred update UserEditCountUpdate failed: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = '2552146' AND (user_editcount IS NOT NULL)
Function: UserEditCountUpdate::doUpdate
Error: 1205 Lock wait timeout exceeded; try restarting transaction
#0 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/Database.php(1610): Wikimedia\Rdbms\Database->getQueryException
#1 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/Database.php(1587): Wikimedia\Rdbms\Database->getQueryExceptionAndLog
#2 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/Database.php(1166): Wikimedia\Rdbms\Database->reportQueryError
#3 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/Database.php(2217): Wikimedia\Rdbms\Database->query
#4 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->update
#5 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/DBConnRef.php(380): Wikimedia\Rdbms\DBConnRef->__call
#6 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/UserEditCountUpdate.php(80): Wikimedia\Rdbms\DBConnRef->update
#7 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/AutoCommitUpdate.php(46): UserEditCountUpdate->{closure}
#8 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/UserEditCountUpdate.php(112): AutoCommitUpdate->doUpdate()
#9 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/DeferredUpdates.php(385): UserEditCountUpdate->doUpdate()
#10 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/DeferredUpdates.php(283): DeferredUpdates::attemptUpdate
#11 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/DeferredUpdates.php(228): DeferredUpdates::run
#12 /srv/mediawiki/php-1.35.0-wmf.5/includes/deferred/DeferredUpdates.php(150): DeferredUpdates::handleUpdateQueue
#13 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(1053): DeferredUpdates::doUpdates('run')
#14 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(833): MediaWiki->restInPeace()
#15 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(842): MediaWiki->{closure}()
#16 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(569): MediaWiki->doPostOutputShutdown()
#17 /srv/mediawiki/php-1.35.0-wmf.5/index.php

The user in question was making edits via the web interface (not API), and contribs shows 3-4 edits per minute during that time frame.

Krinkle claimed this task.

No longer seen in Logstash when searching for UserEditCountUpdate in the past 7 days.