Page MenuHomePhabricator

Page or file undelete/restore requests can deadlock (Fatal DBQueryError)
Open, MediumPublicPRODUCTION ERROR

Description

Error
2019-05-02 10:19:29 [XMrEMQpAMDwAAGzj1h4AAAAW] mw1225 enwiki 1.34.0-wmf.1 exception ERROR: [XMrEMQpAMDwAAGzj1h4AAAAW] /w/api.php   Wikimedia\Rdbms\DBQueryError

Query: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_id) VALUES ('2','Amorymeltzer/sandbox/5','','0','1','0.206380669357','20190502101929','0','0','60635647')
Function: WikiPage::insertOn
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.32.64)
trace
#0 /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php(1556): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#1 /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php(1274): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.34.0-wmf.1/includes/libs/rdbms/database/Database.php(2149): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.34.0-wmf.1/includes/page/WikiPage.php(1345): Wikimedia\Rdbms\Database->insert(string, array, string, string)
#4 /srv/mediawiki/php-1.34.0-wmf.1/includes/page/PageArchive.php(735): WikiPage->insertOn(Wikimedia\Rdbms\DatabaseMysqli, string)
#5 /srv/mediawiki/php-1.34.0-wmf.1/includes/page/PageArchive.php(506): PageArchive->undeleteRevisions(array, boolean, string)
#6 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiUndelete.php(74): PageArchive->undelete(array, string, NULL, boolean, User, NULL)
#7 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiMain.php(1593): ApiUndelete->execute()
#8 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiMain.php(531): ApiMain->executeAction()
#9 /srv/mediawiki/php-1.34.0-wmf.1/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#10 /srv/mediawiki/php-1.34.0-wmf.1/api.php(87): ApiMain->execute()
#11 /srv/mediawiki/w/api.php(3): require(string)
#12 {main}
Impact

User actions to restore pages or files are sometimes aborted.

There appears to be problematic lock or database query that can end up depending on another instance of itself in both directions, ergo, a deadlock.

When this happens, the user receives a generic system error page, and the spike of application crashes may end up aborting an unrelated on-going deployment or fire an Icgina monitoring alert.

Notes

(I'm not smart enough to know for a fact that this hasn't already been reported; there are a number of similar tickets, but those that are not closed appear to be somewhat different.)

Basically, attempting a bunch of action=undelete queries can result in multiple internal_api_error_DBQueryError errors, reported as Wikimedia\Rdbms\DBQueryError and a varying string like XMrEMQpAMDwAAGzj1h4AAAAW or XMtbogpAAEwAAA7kBZ0AAACO.

Sometimes it works fine, and sometimes 20%-80% of the queries fail, even after attempting them all twice. Has been going on for at least a few months, possibly much longer. This is using the batchundelete module from Twinkle on enwiki (or testwiki). The same code is used for batchdelete and doesn't result in the same errors.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2019, 9:08 PM
Reedy updated the task description. (Show Details)
Anomie moved this task from Unsorted to Non-core-API stuff on the MediaWiki-API board.
Anomie added a subscriber: Anomie.

I don't think this is likely to be an issue in the API itself, but instead in the deletion code used by the API (and the normal web deletion action as well). Although my best guess is that it's going to get into tricky details of MySQL/MariaDB gap locking, which I can't say I'm very familiar with.

If Twinkle is sending its action=delete requests in parallel, you might want to see if sending them serially (i.e. waiting for each one to complete before sending the next) avoids the issue. If it's already working serially, adding a short delay between retries might help.

@Anomie It's done in parallel, but the issue here arises specifically with undelete, not with delete, even though both are handled in the same way. That's what raised my eyebrows, since I wouldn't expect delete and undelete to behave/react differently.

Anomie added a comment.May 8 2019, 3:31 PM

Err, yeah, I meant "action=undelete requests".

As I said, I suspect it's MariaDB gap locking when reinserting the page rows. There isn't an actual row with page_id=60635647, so it locks the "gap" between the previous and next rows that do exist when doing a SELECT FOR UPDATE or an INSERT. Multiple SELECT gap-locks can exist on the same gap without conflicting, so if two processes do that on the same gap neither will block the other. But once they get to the INSERT, process A's gap lock will block process B and process B's gap lock will block process A, which is the deadlock.

@Anomie Thanks for that explanation. If I understand correctly, this issue means that in general we cannot reliably let multiple admins on the same wiki use the "page restoration" and "file undelete" features (e.g. the issue is not specific to a single user performing actions in multiple tabs).

E.g. on Commons when admins go through a backlog of requests, it's not uncommon to handle a few at the same time, in quick succession, or in collaboration with other admins. (<= Keywords for Phab search)

Krinkle renamed this task from Bulk API undeletion results in database query error to Multiple page or file undelete/restore requests around the same time can deadlock (Fatal DBQueryError).May 30 2019, 12:20 AM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM
Restricted Application added a project: Platform Engineering. · View Herald TranscriptAug 28 2019, 11:07 PM

@Anomie Thanks for that explanation. If I understand correctly, this issue means that in general we cannot reliably let multiple admins on the same wiki use the "page restoration" and "file undelete" features (e.g. the issue is not specific to a single user performing actions in multiple tabs).

I'd say yes and no. Parallel in human terms (be that one admin in multiple tabs or multiple admins at the same time) is quite different from parallel in machine terms. I think admins can reliably do such actions in parallel, but they might experience a random failure here and there (I think we talking here in terms of once every 10000+ actions). When parallel requests are issued for undeletes, it's a different story, since the chances of these requests landing on the appservers in such successions so that multiple requests are started before one is completed are pretty high.

I concur with @Anomie that Twikle should issue requests consecutively, at least for undelete and insert requests. Operationally, we can't take a different approach but to fill the gaps as we run the risk of running out of IDs more quickly that way. Conversely, having stable page IDs would also mitigate this problem.

Anomie added a comment.Sep 3 2019, 1:59 PM

Conversely, having stable page IDs would also mitigate this problem.

Having stable page IDs wouldn't mitigate the problem, unless it also meant we never delete rows from the page table in the first place.

Now that I take a closer look at the code, it looks like it may be a gap lock on the name_title index rather than on the PK, since that's what's being used with FOR UPDATE (PageArchive.php lines 507–513). Either way, the issue only arises when undeletions are targeting the same gap in the index.

eprodromou triaged this task as Medium priority.Sep 11 2019, 5:52 PM
AMooney changed the task status from Open to Stalled.Mar 12 2020, 1:21 PM
WDoranWMF moved this task from Inbox to Icebox on the Platform Engineering board.Mar 12 2020, 1:31 PM

Still seen in production logs. Recent sample:

Error
Query: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_id) VALUES (14,'…','',0,1,'0.362548733238','20200519190447',0,0,'51026936')
Function: WikiPage::insertOn
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.48.124)
trace
0 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/Database.php(1663): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/Database.php(1640): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/Database.php(1215): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/Database.php(2319): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/Database.php(2281): Wikimedia\Rdbms\Database->doInsertNonConflicting(string, array, string)
#5 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#6 /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.35.0-wmf.32/includes/page/WikiPage.php(1361): Wikimedia\Rdbms\DBConnRef->insert(string, array, string, array)
#8 /srv/mediawiki/php-1.35.0-wmf.32/includes/page/PageArchive.php(745): WikiPage->insertOn(Wikimedia\Rdbms\MaintainableDBConnRef, string)
#9 /srv/mediawiki/php-1.35.0-wmf.32/includes/page/PageArchive.php(521): PageArchive->undeleteRevisions(array, boolean, string)
#10 /srv/mediawiki/php-1.35.0-wmf.32/includes/api/ApiUndelete.php(74): PageArchive->undeleteAsUser(array, User, string, NULL, boolean, NULL)
#11 /srv/mediawiki/php-1.35.0-wmf.32/includes/api/ApiMain.php(1580): ApiUndelete->execute()
#12 /srv/mediawiki/php-1.35.0-wmf.32/includes/api/ApiMain.php(523): ApiMain->executeAction()
#13 /srv/mediawiki/php-1.35.0-wmf.32/includes/api/ApiMain.php(494): ApiMain->executeActionWithErrorHandling()
#14 /srv/mediawiki/php-1.35.0-wmf.32/api.php(84): ApiMain->execute()
#15 /srv/mediawiki/w/api.php(3): require(string)
#16 {main}
Krinkle renamed this task from Multiple page or file undelete/restore requests around the same time can deadlock (Fatal DBQueryError) to Page or file undelete/restore requests can deadlock (Fatal DBQueryError).May 27 2020, 7:10 PM
Krinkle updated the task description. (Show Details)
Krinkle moved this task from Icebox to Inbox on the Platform Engineering board.May 27 2020, 9:39 PM
Krinkle changed the task status from Stalled to Open.May 28 2020, 5:37 PM

Change 611420 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/core@master] Avoid gap locking in PageArchive::undeleteRevisions

https://gerrit.wikimedia.org/r/611420

Change 611420 merged by jenkins-bot:
[mediawiki/core@master] Avoid gap locking in PageArchive::undeleteRevisions

https://gerrit.wikimedia.org/r/611420

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:02 PM

Please checks log, if this is still an issue or if the merged patch set helps. Thanks.

Thanks for the bump @Umherirrender, I've been meaning to comment that yes, I haven't run into it at all and even after removing my safeguards, don't trigger it anymore so seems great on my end AFAICT.

Thanks for the bump @Umherirrender, I've been meaning to comment that yes, I haven't run into it at all and even after removing my safeguards, don't trigger it anymore so seems great on my end AFAICT.

That sounds very good, hopefully the logs also shows this good result.


In this task was sometimes written that things are happen in parallel, so just linking:
https://www.mediawiki.org/wiki/API:Etiquette#Request_limit

Making your requests in series rather than in parallel, by waiting for one request to finish before sending a new request [...]