Page MenuHomePhabricator

Multiple page or file undelete/restore requests around the same time can deadlock (Fatal DBQueryError)
Open, NormalPublic

Description

(I'm not smart enough to know for a fact that this hasn't already been reported; there are a number of similar tickets, but those that are not closed appear to be somewhat different.)

Basically, attempting a bunch of action=undelete queries can result in multiple internal_api_error_DBQueryError errors, reported as Wikimedia\Rdbms\DBQueryError and a varying string like XMrEMQpAMDwAAGzj1h4AAAAW or XMtbogpAAEwAAA7kBZ0AAACO.

Sometimes it works fine, and sometimes 20%-80% of the queries fail, even after attempting them all twice. Has been going on for at least a few months, possibly much longer. This is using the batchundelete module from Twinkle on enwiki (or testwiki). The same code is used for batchdelete and doesn't result in the same errors.

2019-05-02 10:19:29 [XMrEMQpAMDwAAGzj1h4AAAAW] mw1225 enwiki 1.34.0-wmf.1 exception ERROR: [XMrEMQpAMDwAAGzj1h4AAAAW] /w/api.php   Wikimedia\Rdbms\DBQueryError from line 1587 of /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_id) VALUES ('2','Amorymeltzer/sandbox/5','','0','1','0.206380669357','20190502101929','0','0','60635647')
Function: WikiPage::insertOn
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.32.64)
 {"exception_id":"XMrEMQpAMDwAAGzj1h4AAAAW","exception_url":"/w/api.php","caught_by":"mwe_handler"} 
[Exception Wikimedia\Rdbms\DBQueryError] (/srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php:1587) A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_id) VALUES ('2','Amorymeltzer/sandbox/5','','0','1','0.206380669357','20190502101929','0','0','60635647')
Function: WikiPage::insertOn
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.32.64)

  #0 /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php(1556): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
  #1 /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php(1274): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
  #2 /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php(2149): Wikimedia\Rdbms\Database->query(string, string)
  #3 /srv/mediawiki/php-1.34.0-wmf.1/includes/page/WikiPage.php(1345): Wikimedia\Rdbms\Database->insert(string, array, string, string)
  #4 /srv/mediawiki/php-1.34.0-wmf.1/includes/page/PageArchive.php(735): WikiPage->insertOn(Wikimedia\Rdbms\DatabaseMysqli, string)
  #5 /srv/mediawiki/php-1.34.0-wmf.1/includes/page/PageArchive.php(506): PageArchive->undeleteRevisions(array, boolean, string)
  #6 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiUndelete.php(74): PageArchive->undelete(array, string, NULL, boolean, User, NULL)
  #7 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiMain.php(1593): ApiUndelete->execute()
  #8 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiMain.php(531): ApiMain->executeAction()
  #9 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
  #10 /srv/mediawiki/php-1.34.0-wmf.1/api.php(87): ApiMain->execute()
  #11 /srv/mediawiki/w/api.php(3): require(string)
  #12 {main}

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2019, 9:08 PM
Reedy updated the task description. (Show Details)
Anomie moved this task from Unsorted to Non-core-API stuff on the MediaWiki-API board.
Anomie added a subscriber: Anomie.

I don't think this is likely to be an issue in the API itself, but instead in the deletion code used by the API (and the normal web deletion action as well). Although my best guess is that it's going to get into tricky details of MySQL/MariaDB gap locking, which I can't say I'm very familiar with.

If Twinkle is sending its action=delete requests in parallel, you might want to see if sending them serially (i.e. waiting for each one to complete before sending the next) avoids the issue. If it's already working serially, adding a short delay between retries might help.

@Anomie It's done in parallel, but the issue here arises specifically with undelete, not with delete, even though both are handled in the same way. That's what raised my eyebrows, since I wouldn't expect delete and undelete to behave/react differently.

Anomie added a comment.May 8 2019, 3:31 PM

Err, yeah, I meant "action=undelete requests".

As I said, I suspect it's MariaDB gap locking when reinserting the page rows. There isn't an actual row with page_id=60635647, so it locks the "gap" between the previous and next rows that do exist when doing a SELECT FOR UPDATE or an INSERT. Multiple SELECT gap-locks can exist on the same gap without conflicting, so if two processes do that on the same gap neither will block the other. But once they get to the INSERT, process A's gap lock will block process B and process B's gap lock will block process A, which is the deadlock.

@Anomie Thanks for that explanation. If I understand correctly, this issue means that in general we cannot reliably let multiple admins on the same wiki use the "page restoration" and "file undelete" features (e.g. the issue is not specific to a single user performing actions in multiple tabs).

E.g. on Commons when admins go through a backlog of requests, it's not uncommon to handle a few at the same time, in quick succession, or in collaboration with other admins. (<= Keywords for Phab search)

Krinkle renamed this task from Bulk API undeletion results in database query error to Multiple page or file undelete/restore requests around the same time can deadlock (Fatal DBQueryError).May 30 2019, 12:20 AM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM
Restricted Application added a project: Core Platform Team. · View Herald TranscriptAug 28 2019, 11:07 PM

@Anomie Thanks for that explanation. If I understand correctly, this issue means that in general we cannot reliably let multiple admins on the same wiki use the "page restoration" and "file undelete" features (e.g. the issue is not specific to a single user performing actions in multiple tabs).

I'd say yes and no. Parallel in human terms (be that one admin in multiple tabs or multiple admins at the same time) is quite different from parallel in machine terms. I think admins can reliably do such actions in parallel, but they might experience a random failure here and there (I think we talking here in terms of once every 10000+ actions). When parallel requests are issued for undeletes, it's a different story, since the chances of these requests landing on the appservers in such successions so that multiple requests are started before one is completed are pretty high.

I concur with @Anomie that Twikle should issue requests consecutively, at least for undelete and insert requests. Operationally, we can't take a different approach but to fill the gaps as we run the risk of running out of IDs more quickly that way. Conversely, having stable page IDs would also mitigate this problem.

Anomie added a comment.Sep 3 2019, 1:59 PM

Conversely, having stable page IDs would also mitigate this problem.

Having stable page IDs wouldn't mitigate the problem, unless it also meant we never delete rows from the page table in the first place.

Now that I take a closer look at the code, it looks like it may be a gap lock on the name_title index rather than on the PK, since that's what's being used with FOR UPDATE (PageArchive.php lines 507–513). Either way, the issue only arises when undeletions are targeting the same gap in the index.

eprodromou triaged this task as Normal priority.Sep 11 2019, 5:52 PM