Page MenuHomePhabricator

Error "Lock wait timeout exceeded" from User::loadFromDatabase (via API action=options)
Closed, ResolvedPublic

Description

We've been having a constant stream of the following errors happening mostly on Commons:

message
Error 1205: Lock wait timeout exceeded; try restarting transaction (10.64.48.124)
Function: User::loadFromDatabase
Query: SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id  FROM `user` JOIN `actor` `user_actor` ON ((user_actor.actor_user = user_id))   WHERE user_id = xxx  LIMIT x   FOR UPDATE
referrer: https://commons.wikimedia.org/wiki/Special:ImportFile

exception.file: /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php:1719

exception.trace
from /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1719)
#0 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1697): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#1 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1260): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1948): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(2048): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/DBConnRef.php(331): Wikimedia\Rdbms\DBConnRef->__call(string, array)

Event Timeline

Krinkle moved this task from Untriaged to Apr 2021 on the Wikimedia-production-error board.
Krinkle subscribed.

Of the 5,400 Lock wait timeout exceeded errors in the last week from function User::loadFromDatabase, only 3 have referer Special:ImportFile. The referer can be useful, but in most cases it's the url end trace endpoint that matters more. This code isn't part of Special:ImportFile and isn't triggered during that request, but rather during an API request made by JS code on Special:ImportFile.

The most important part of the stack trace is after the rdbms internals. In this case, it's ApiOptions.php (API action=options).

exception.trace
from /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1738)
#0 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1722): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1697): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1260): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(1948): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/Database.php(2048): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#6 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/database/DBConnRef.php(331): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.37.0-wmf.1/includes/user/User.php(1335): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#8 /srv/mediawiki/php-1.37.0-wmf.1/includes/user/User.php(458): User->loadFromDatabase(integer)
#9 /srv/mediawiki/php-1.37.0-wmf.1/includes/user/User.php(4484): User->loadFromId(integer)
#10 /srv/mediawiki/php-1.37.0-wmf.1/includes/api/ApiOptions.php(165): User->getInstanceForUpdate()
#11 /srv/mediawiki/php-1.37.0-wmf.1/includes/api/ApiOptions.php(40): ApiOptions->getUserForUpdates()
#12 /srv/mediawiki/php-1.37.0-wmf.1/includes/api/ApiMain.php(1647): ApiOptions->execute()
#13 /srv/mediawiki/php-1.37.0-wmf.1/includes/api/ApiMain.php(617): ApiMain->executeAction()
#14 /srv/mediawiki/php-1.37.0-wmf.1/includes/api/ApiMain.php(588): ApiMain->executeActionWithErrorHandling()
#15 /srv/mediawiki/php-1.37.0-wmf.1/api.php(90): ApiMain->execute()
#16 /srv/mediawiki/php-1.37.0-wmf.1/api.php(45): wfApiMain()
#17 /srv/mediawiki/w/api.php(3): require(string)
#18 {main}

reqId: 7bc7adce-2e19-4fe8-9c92-5f6594f14ef2

Krinkle renamed this task from Error 1205 from User::loadFromDatabase, Lock wait timeout exceeded; try restarting transaction on Special:ImportFile to Error "Lock wait timeout exceeded" from User::loadFromDatabase (via API action=options).Apr 15 2021, 7:46 PM
Krinkle updated the task description. (Show Details)
Krinkle triaged this task as High priority.Apr 15 2021, 7:49 PM

Especially significant is that this error used to be rare but is at 1000x per day as of last week.

cap.png (875×2 px, 243 KB)

It's interesting to see that there are so many JS callers of mw.Api().saveOption(). It reminds me of T128602. I think preferences could be redesigned around this use case -- it was originally imagined as mostly being set in bulk via Special:Preferences, rather than being a generic user-linked data store.

Nothing in User::saveSettings() really makes sense for this use case. UserOptionsManager::saveOptions() doesn't make much sense either. If nothing is changed (presumably the typical case since there is no special handling for unchanged options in JS callers), the user row is still locked and updated. UserOptionsManager eventually figures out there is nothing to do, after loading all user options twice: once in loadUserOptions() and again in saveOptions(), but all this is done with the user row locked.

The error rate for this plummeted back to its former low and rare ocurrance rate. The drop was sudden and happened around Sunday 2021-04-18 00:00.

Prior to that the lock timeouts were almost entirely from Commons:

Screenshot 2021-04-22 at 02.34.36.png (622×1 px, 35 KB)

If the reason for these timeouts is not a slow query, but rather slow PHP code while the transaction is held open, then this might be a side-effect of: T278650: Loading or saving preferences is taking too long on Commons.

It could still be considered a bug in its own right in that we ideally wouldn't be holding the transaction over as much code as long as we currently do (if that's what's happening).

[…] The drop was sudden and happened around Sunday 2021-04-18 00:00. […] the lock timeouts were almost entirely from Commons:

Screenshot 2021-04-22 at 02.34.36.png (622×1 px, 35 KB)

If […], then this might be a side-effect of: T278650: Loading or saving preferences is taking too long on Commons.

The drop aligns well with this edit to Template:Gadget-state, at 2021-04-18 00:15 UTC, which is an edit of mine, in which I reverted Lua code back out of this template. Details about that template at T278650#7012163.

The drop can also be seen in the ApiOptions p99 request service time, which dropped from ~25 seconds to ~5 seconds. The median service time is about 700ms, and for pretty much all of that time, the user row is locked, since User::getInstanceForUpdate() is the first thing the module does, and User::saveSettings() is the last thing it does before output.

ApiOptions time distribution.png (347×811 px, 49 KB)

I was scratching my head about T263520 trying to figure out how the user_editcount update could be causing lock contention, since it's done in autocommit mode. But maybe User::saveSettings() was the problem all along. Calling it from eval.php it takes 200-600ms.

My ideas are:

  • Introduce "dirty" flags to the User object. For example, if the email address is updated, you would do $this->mDirty['email'] = true. Then in User::saveSettings(), only update dirty data. UserOptionsManager::saveOptions() would only be called if the user options were updated.
  • Do the same in UserOptionsManager. Record dirty options in setOption() and then do not write back unmodified options.
  • Consider a setOption() with a value which does not differ from the cached value to not make the value dirty.
  • Get rid of the lock for update in ApiOptions since it apparently exists to avoid writing back stale data read by UserOptionsManager::loadUserOptions().

T278650#7130951 (just deployed) cut the median to half and p99 to one fifth. Can we check if it continues to happen?

Change 699466 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] Improvements to user prefereces fetching/saving

https://gerrit.wikimedia.org/r/699466

Change 700722 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] In UserOptionsManager::saveOptions(), don't update unchanged rows

https://gerrit.wikimedia.org/r/700722

Change 699466 merged by jenkins-bot:

[mediawiki/core@master] Improvements to user preferences fetching/saving

https://gerrit.wikimedia.org/r/699466

Change 700722 merged by jenkins-bot:

[mediawiki/core@master] In UserOptionsManager::saveOptions(), don't update unchanged rows

https://gerrit.wikimedia.org/r/700722

Pchelolo claimed this task.
Pchelolo subscribed.

Last one of these happened on Jun 25, I think this could be marked as resolved. Please reopen if it starts happening agin.