Page MenuHomePhabricator

Database contention due to concurrent requests to ApiOptions
Closed, ResolvedPublic

Description

As I mentioned at T202715#4865835 , I believe our main cause of contention at the moment is the following query:

User::loadFromDatabase
Lock wait timeout exceeded; try restarting transaction
SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id  FROM `user` LEFT JOIN `actor` `user_actor` ON ((user_actor.actor_user = user_id))   WHERE user_id = '388680'  LIMIT 1   FOR UPDATE

As detailed in T214963#4918834, this seems to be due to VisualEditor's find-and-replace dialog submitting concurrent requests to the API's action=options endpoint. We also note that the same dialog doesn't avoid submitting useless requests when the user is not logged in.

This seems to have been going on for some time; it has been confirmed in 1.33.0-wmf.3, and the ApiOptions part of T206098 may have been the same cause too.

Event Timeline

jcrespo created this task.Jan 30 2019, 9:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2019, 9:12 AM
Nirmos added a subscriber: Nirmos.Jan 30 2019, 3:49 PM

and may get worse as more rows are inserted on actor (?).

FYI: On all wikis except enwiki, the back-population of actor has been complete for a few days. For enwiki, it should be mostly complete.

New rows will continue to be added as new IPs edit and new accounts are registered, of course.

Anomie added a project: VisualEditor.EditedJan 30 2019, 7:30 PM

I did some looking at backtraces for "Lock wait timeout" exceptions from User::loadFromDatabase in wmf.14. There were 2248 such exceptions. 2245 of them had basically the same backtrace:

#0 /srv/mediawiki/php-1.33.0-wmf.14/includes/libs/rdbms/database/Database.php(1476): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.33.0-wmf.14/includes/libs/rdbms/database/Database.php(1236): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.33.0-wmf.14/includes/libs/rdbms/database/Database.php(1703): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.33.0-wmf.14/includes/libs/rdbms/database/Database.php(1794): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.33.0-wmf.14/includes/user/User.php(1459): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.33.0-wmf.14/includes/user/User.php(470): User->loadFromDatabase(integer)
#6 /srv/mediawiki/php-1.33.0-wmf.14/includes/user/User.php(5807): User->loadFromId(integer)
#7 /srv/mediawiki/php-1.33.0-wmf.14/includes/api/ApiOptions.php(150): User->getInstanceForUpdate()
#8 /srv/mediawiki/php-1.33.0-wmf.14/includes/api/ApiOptions.php(39): ApiOptions->getUserForUpdates()
#9 /srv/mediawiki/php-1.33.0-wmf.14/includes/api/ApiMain.php(1596): ApiOptions->execute()
#10 /srv/mediawiki/php-1.33.0-wmf.14/includes/api/ApiMain.php(531): ApiMain->executeAction()
#11 /srv/mediawiki/php-1.33.0-wmf.14/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#12 /srv/mediawiki/php-1.33.0-wmf.14/api.php(87): ApiMain->execute()
#13 /srv/mediawiki/w/api.php(3): include(string)
#14 {main}

So something hitting the API action=options endpoint. And it turns out that, while there's a bit of a long tail, many of these seem to be due to the same users.

Looking into the 1498 instances of "internal_api_error_DBQueryError" from action=options in https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/ApiAction over the past week-ish, I see a lot of hits that are trying to set options with names prefixed by visualeditor-findAndReplace- (1221 of the 1498 logged errors). Glancing at the code in ve.ui.FindAndReplaceDialog.js, it looks like if someone manages to change the fields rapidly enough it'll be sending a lot of requests to continually update the preferences.

Also it looks like there's no code in VE there to avoid uselessly trying to set options for IP users, or if there is it's broken as in testing while logged-out I saw my browser trying to submit action=options requests.

Anomie added a comment.EditedJan 30 2019, 7:32 PM

BTW, I see basically the same results when looking at backtraces for "Lock wait timeout" exceptions from User::loadFromDatabase in wmf.3 (fortunately still within the retention window on mwlog1001), which was before the actor change was rolled out anywhere. 2393 exceptions logged, 2388 of which look like this:

Query: SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount  FROM `user`    WHERE user_id = '...'  LIMIT 1   FOR UPDATE
       #0 /srv/mediawiki/php-1.33.0-wmf.3/includes/libs/rdbms/database/Database.php(1476): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
       #1 /srv/mediawiki/php-1.33.0-wmf.3/includes/libs/rdbms/database/Database.php(1236): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
       #2 /srv/mediawiki/php-1.33.0-wmf.3/includes/libs/rdbms/database/Database.php(1703): Wikimedia\Rdbms\Database->query(string, string)
       #3 /srv/mediawiki/php-1.33.0-wmf.3/includes/libs/rdbms/database/Database.php(1794): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
       #4 /srv/mediawiki/php-1.33.0-wmf.3/includes/user/User.php(1461): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
       #5 /srv/mediawiki/php-1.33.0-wmf.3/includes/user/User.php(477): User->loadFromDatabase(integer)
       #6 /srv/mediawiki/php-1.33.0-wmf.3/includes/user/User.php(5787): User->loadFromId(integer)
       #7 /srv/mediawiki/php-1.33.0-wmf.3/includes/api/ApiOptions.php(150): User->getInstanceForUpdate()
       #8 /srv/mediawiki/php-1.33.0-wmf.3/includes/api/ApiOptions.php(39): ApiOptions->getUserForUpdates()
       #9 /srv/mediawiki/php-1.33.0-wmf.3/includes/api/ApiMain.php(1570): ApiOptions->execute()
       #10 /srv/mediawiki/php-1.33.0-wmf.3/includes/api/ApiMain.php(531): ApiMain->executeAction()
       #11 /srv/mediawiki/php-1.33.0-wmf.3/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
       #12 /srv/mediawiki/php-1.33.0-wmf.3/api.php(87): ApiMain->execute()
       #13 /srv/mediawiki/w/api.php(3): include(string)
       #14 {main}

The Analytics data for November 8–15 shows similar results too.

Please update my summary with a more accurate description as some of my suppositions may be wrong.

Anomie renamed this task from User::loadFromDatabase creates too much contention due to actor locking to Database contention due to concurrent requests to ApiOptions.Jan 31 2019, 6:23 PM
Anomie updated the task description. (Show Details)

Please update my summary with a more accurate description as some of my suppositions may be wrong.

Done

matmarex added a subscriber: matmarex.

Would you say that this is enough to resolve the problem from the VE side? (And also, good general guidance for using action=options?)

  • Do not send action=options API requests for IP users
  • Batch the events for (let's say) 0.5s before sending API requests
  • Wait for the previous action=options API request to finish before sending another
Anomie added a comment.EditedFeb 1 2019, 10:07 PM

Would you say that this is enough to resolve the problem from the VE side? (And also, good general guidance for using action=options?)

  • Do not send action=options API requests for IP users
  • Batch the events for (let's say) 0.5s before sending API requests
  • Wait for the previous action=options API request to finish before sending another

Bullets 1 and 3 are good general guidance. action=options is going to return a "notloggedin" error for IPs, so there's no point in sending the request in the first place. Serializing writes to the same target is also good general practice.

Bullet 2 is probably not very general to action=options, but is general to event handling where you can expect multiple events to come in short succession (e.g. change events that fire each time a user types more text into a text field).

Making these three changes should resolve the issue, although the real test would be deploying the change and seeing whether the DB error stops.

Change 487547 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] mw.api: Prevent misusing #saveOptions

https://gerrit.wikimedia.org/r/487547

Change 487548 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[VisualEditor/VisualEditor@master] ve.ui.FindAndReplaceDialog: Debounce calls to save user options

https://gerrit.wikimedia.org/r/487548

Change 487548 merged by jenkins-bot:
[VisualEditor/VisualEditor@master] ve.ui.FindAndReplaceDialog: Debounce calls to save user options

https://gerrit.wikimedia.org/r/487548

Then I propose we do 1 and 3 in MW, and 2 in VE. But we can do it all in VE if I am missing any reason it can't be done in MW.

Change 487549 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (3c1c8429d)

https://gerrit.wikimedia.org/r/487549

Change 487553 had a related patch set uploaded (by Jforrester; owner: Bartosz Dziewoński):
[VisualEditor/VisualEditor@wmf/1.33.0-wmf.14] ve.ui.FindAndReplaceDialog: Debounce calls to save user options

https://gerrit.wikimedia.org/r/487553

Change 487553 merged by Jforrester:
[VisualEditor/VisualEditor@wmf/1.33.0-wmf.14] ve.ui.FindAndReplaceDialog: Debounce calls to save user options

https://gerrit.wikimedia.org/r/487553

Change 487554 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/VisualEditor@wmf/1.33.0-wmf.14] Update VE core submodule to wmf/1.33.0-wmf.14 HEAD (a4a87905)

https://gerrit.wikimedia.org/r/487554

Change 487549 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (3c1c8429d)

https://gerrit.wikimedia.org/r/487549

Change 487560 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/VisualEditor@master] mw.Platform.setUserConfig: Don't try to set for logged-out users, it doesn't

https://gerrit.wikimedia.org/r/487560

Change 487554 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.33.0-wmf.14] Update VE core submodule to wmf/1.33.0-wmf.14 HEAD (a4a87905)

https://gerrit.wikimedia.org/r/487554

Mentioned in SAL (#wikimedia-operations) [2019-02-02T05:44:20Z] <jforrester@deploy1001> Synchronized php-1.33.0-wmf.14/extensions/VisualEditor/lib/ve/src/ui/dialogs/ve.ui.FindAndReplaceDialog.js: b/src/ui/dialogs/ve.ui.FindAndReplaceDialog.js T214963 Hot-deploy VE fix to stop hitting user pref writes without debounce (duration: 01m 02s)

Jdforrester-WMF triaged this task as High priority.Feb 2 2019, 5:47 AM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

OK, I've hot-deployed the VE fix to de-bounce this and confirmed it's working in production. This should alleviate the issue for now, and so unburden the DBs, but we'll do the rest before closing this task.

Change 487560 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] mw.Platform.setUserConfig: Don't try to set for logged-out users, it doesn't

https://gerrit.wikimedia.org/r/487560

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/487547 is the last patch pending review. I would welcome opinions on whether this should be done in MediaWiki for all users of the API module, as I'm proposing here, or whether I should move this to VisualEditor's code.

Change 487547 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.api: Prevent misusing #saveOptions

https://gerrit.wikimedia.org/r/487547

matmarex closed this task as Resolved.Feb 18 2019, 8:43 PM
matmarex removed a project: Patch-For-Review.
matmarex claimed this task.
matmarex reopened this task as Open.

(Eh, I guess we should go through the motions if I've put it on our workboard; for anyone else watching it, please consider it resolved)