Page MenuHomePhabricator

DBError "Error: 1213 Deadlock found when trying to get lock" on WikiPage::doUpdateRestrictions
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: AWhbdKDxzpjgITg6aL2N

message
A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `page_restrictions` (pr_page,pr_type,pr_level,pr_cascade,pr_expiry) VALUES ('xxx','xxx','xxx','xxx','xxx')
Function: WikiPage::doUpdateRestrictions
Error: 1213 Deadlock found when trying to get lock; try restarting transaction
trace
#0 /srv/mediawiki/php-1.33.0-wmf.12/includes/libs/rdbms/database/Database.php(1476): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.33.0-wmf.12/includes/libs/rdbms/database/Database.php(1236): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.33.0-wmf.12/includes/libs/rdbms/database/Database.php(2068): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.33.0-wmf.12/includes/page/WikiPage.php(2295): Wikimedia\Rdbms\Database->insert(string, array, string)
#4 /srv/mediawiki/php-1.33.0-wmf.12/includes/api/ApiProtect.php(114): WikiPage->doUpdateRestrictions(array, array, boolean, string, User, NULL)
#5 /srv/mediawiki/php-1.33.0-wmf.12/includes/api/ApiMain.php(1596): ApiProtect->execute()
#6 /srv/mediawiki/php-1.33.0-wmf.12/includes/api/ApiMain.php(531): ApiMain->executeAction()
#7 /srv/mediawiki/php-1.33.0-wmf.12/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#8 /srv/mediawiki/php-1.33.0-wmf.12/api.php(87): ApiMain->execute()
#9 /srv/mediawiki/w/api.php(3): include(string)
#10 {main}

Impact

Unknown, happened 19 times in the last 7 days, 15 of which have been triggered by me.

Notes

I got this error while using a massblock script which also edits and protects user page and talk of the blocked users. Running the script with 15 users returned this error for some of them.

Event Timeline

Daimona renamed this task from SQL error 1213 on WikiPage::doUpdateRestrictions to DBError "Error: 1213 Deadlock found when trying to get lock" on WikiPage::doUpdateRestrictions.Jan 17 2019, 12:50 PM

I could trigger this again via the same script. This leads me to think that the bug is very likely to happen when several protection requests are sent subsequently.

I got this error while using a massblock script which also edits and protects user page and talk of the blocked users. Running the script with 15 users returned this error for some of them.

Can you link to the script?...

@daniel Sure! FWIW, here is the one I mentioned in the task description. However, I can see that it's not the only script causing this problem, and I suspect that all the entries you see on logstash for the last 7 days are instead from this one.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

Still occasionally seen in prod. One of those hundreds of errors that is sufficiently irregular that I often forget to filter it out or accidentally remove, but than always pops up again unexpectedly and makes one wonder if it's a regression or not, until finding this task to realise it's not.

It sounds like it might be a client problem if so, should/could the Api should catch it in some way and send an appropriate error (so that it isn't reported as an application/internal problem), or to prevent the action from being executed in the first place if it isn't legal / known to fail. Perhaps some kind of early lock is needed, or maybe some existing transaction guard isn't working correctly.

See also T218446 which might be related and/or obsolete the issue.

I thought about this a bit. My impression is that this may be caused by the locks set by DELETE statements issued from WikiPage::doUpdateRestrictions.

WikiPage::doUpdateRestrictions, in a nutshell, does the following when called for some page ID Q with protection type Y:

DELETE FROM page_restrictions WHERE pr_page = Q AND pr_type = X;
DELETE FROM page_restrictions WHERE pr_page = Q AND pr_type = Y;
INSERT INTO page_restrictions (...) VALUES (...)
  • Let Q1 be the highest pr_page value in the page_restrictions table that is smaller than Q.
  • Let Q2 be the lowest pr_page value in the page_restrictions table that is larger than Q.

If the page did not have any protection settings before (no records exist where pr_page = Q), then the DELETEs will set a gap lock for the gap Q1 < pr_page < Q2, preventing any INSERTs into this gap (based on local testing with MySQL). This lock does not, however, prevent other DELETE statement from attempting to delete rows where the pr_page number falls into this gap. Thus, if another thread calls WikiPage::doUpdateRestrictions for a different pr_page value that is also between Q1 and Q2, it leads to a deadlock as neither thread can INSERT into the gap and release the lock.

For local testing, I setup the page_restrictions table with the following test data:

+-------+---------+---------+----------+------------+---------+-----------+
| pr_id | pr_page | pr_type | pr_level | pr_cascade | pr_user | pr_expiry |
+-------+---------+---------+----------+------------+---------+-----------+
|     1 |     100 | move    | sysop    |          0 |    NULL | NULL      |
|     2 |     200 | move    | sysop    |          0 |    NULL | NULL      |
+-------+---------+---------+----------+------------+---------+-----------+

At this point, I was able to reproduce the deadlock with the following statements:

TRX1TRX2
SET autocommit=0SET autocommit=0
DELETE FROM page_restrictions WHERE pr_page = 125 and pr_type = 'move';
DELETE FROM page_restrictions WHERE pr_page = 150 and pr_type = 'move';
INSERT INTO page_restrictions (pr_page,pr_type,pr_level,pr_cascade) VALUES (125,'move','sysop',0);
INSERT INTO page_restrictions (pr_page,pr_type,pr_level,pr_cascade) VALUES (150,'move','sysop',0);
💥Deadlock!

Obviously this is a very artificial environment but perhaps these findings might help uncover the cause of this issue.

Change 539699 had a related patch set uploaded (by TK-999; owner: TK-999):
[mediawiki/core@master] WikiPage: Reduce locking in doUpdateRestrictions()

https://gerrit.wikimedia.org/r/539699

Uploaded a tentative fix as per my hypothesis regarding the root cause.

Change 539699 merged by jenkins-bot:
[mediawiki/core@master] WikiPage: Reduce locking in doUpdateRestrictions()

https://gerrit.wikimedia.org/r/539699

Is this still an issue, or did the November patch resolve the errors?

Searching for message:"Deadlock found when trying to get" on mediawiki-errors in Logstash in the past 7 days finds various functions and queries yielding this type of issue, but no longer from doUpdateRestrictions or elsewhere in WikiPage. LGTM!

Krinkle reassigned this task from Krinkle to TK-999.