Page MenuHomePhabricator

CampaignEvents: Avoid deadlocks due to FOR UPDATE queries that may find no records
Closed, ResolvedPublic

Description

I tried to enable concurrency for selenium tests in CampaignEvents, but the patch failed with a deadlock:

mw-error.log
2024-12-05 14:49:58 84f8e934c16e wikidb: [d93ab29109dc05f03a23207a] /rest.php/campaignevents/v0/event_registration   Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction
Function: MediaWiki\Extension\CampaignEvents\Questions\EventQuestionsStore::replaceEventQuestions
Query: INSERT INTO `ce_event_questions` (ceeq_event_id,ceeq_question_id) VALUES (3,1),(3,2),(3,3),(3,4),(3,5)

#0 /workspace/src/includes/libs/rdbms/database/Database.php(1177): Wikimedia\Rdbms\Database->getQueryException(string, int, string, string)
#1 /workspace/src/includes/libs/rdbms/database/Database.php(1151): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, int, string, string)
#2 /workspace/src/includes/libs/rdbms/database/Database.php(645): Wikimedia\Rdbms\Database->reportQueryError(string, int, string, string, bool)
#3 /workspace/src/includes/libs/rdbms/database/Database.php(1471): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#4 /workspace/src/includes/libs/rdbms/database/DBConnRef.php(127): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#5 /workspace/src/includes/libs/rdbms/database/DBConnRef.php(407): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#6 /workspace/src/includes/libs/rdbms/querybuilder/InsertQueryBuilder.php(343): Wikimedia\Rdbms\DBConnRef->insert(string, array, string, array)
#7 /workspace/src/extensions/CampaignEvents/src/Questions/EventQuestionsStore.php(66): Wikimedia\Rdbms\InsertQueryBuilder->execute()
#8 /workspace/src/extensions/CampaignEvents/src/Event/Store/EventStore.php(452): MediaWiki\Extension\CampaignEvents\Questions\EventQuestionsStore->replaceEventQuestions(int, array)
#9 /workspace/src/extensions/CampaignEvents/src/Event/EditEventCommand.php(212): MediaWiki\Extension\CampaignEvents\Event\Store\EventStore->saveRegistration(MediaWiki\Extension\CampaignEvents\Event\EventRegistration)
#10 /workspace/src/extensions/CampaignEvents/src/Event/EditEventCommand.php(111): MediaWiki\Extension\CampaignEvents\Event\EditEventCommand->doEditUnsafe(MediaWiki\Extension\CampaignEvents\Event\EventRegistration, MediaWiki\Extension\CampaignEvents\MWEntity\MWAuthorityProxy, array)
#11 /workspace/src/extensions/CampaignEvents/src/Rest/AbstractEditEventRegistrationHandler.php(108): MediaWiki\Extension\CampaignEvents\Event\EditEventCommand->doEditIfAllowed(MediaWiki\Extension\CampaignEvents\Event\EventRegistration, MediaWiki\Extension\CampaignEvents\MWEntity\MWAuthorityProxy, array)
#12 /workspace/src/includes/Rest/Module/Module.php(386): MediaWiki\Extension\CampaignEvents\Rest\AbstractEditEventRegistrationHandler->execute()
#13 /workspace/src/includes/Rest/Module/Module.php(269): MediaWiki\Rest\Module\Module->executeHandler(MediaWiki\Extension\CampaignEvents\Rest\EnableEventRegistrationHandler)
#14 /workspace/src/includes/Rest/Router.php(487): MediaWiki\Rest\Module\Module->execute(string, MediaWiki\Rest\RequestFromGlobals)
#15 /workspace/src/includes/Rest/Router.php(446): MediaWiki\Rest\Router->doExecute(string, MediaWiki\Rest\RequestFromGlobals)
#16 /workspace/src/includes/Rest/EntryPoint.php(211): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#17 /workspace/src/includes/MediaWikiEntryPoint.php(201): MediaWiki\Rest\EntryPoint->execute()
#18 /workspace/src/rest.php(39): MediaWiki\MediaWikiEntryPoint->run()
#19 {main}
mysql-error.log
2024-12-05 14:49:58 213 [Note] InnoDB: Transactions deadlock detected, dumping detailed information.
2024-12-05 14:49:58 213 [Note] InnoDB: 
*** (1) TRANSACTION:

TRANSACTION 1780, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 8 lock struct(s), heap size 1128, 4 row lock(s), undo log entries 2
MySQL thread id 214, OS thread handle 139981050271488, query id 3406 localhost wikiuser Update
INSERT /* MediaWiki\Extension\CampaignEvents\Questions\EventQuestionsStore::replaceEventQuestions  */ INTO `ce_event_questions` (ceeq_event_id,ceeq_question_id) VALUES (3,1),(3,2),(3,3),(3,4),(3,5)
2024-12-05 14:49:58 213 [Note] InnoDB: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 115 page no 4 n bits 72 index ceeq_event_question of table `wikidb`.`ce_event_questions` trx id 1780 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

2024-12-05 14:49:58 213 [Note] InnoDB: *** (2) TRANSACTION:

TRANSACTION 1779, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
9 lock struct(s), heap size 1128, 5 row lock(s), undo log entries 2
MySQL thread id 213, OS thread handle 139981052729088, query id 3404 localhost wikiuser Update
INSERT /* MediaWiki\Extension\CampaignEvents\Questions\EventQuestionsStore::replaceEventQuestions  */ INTO `ce_event_questions` (ceeq_event_id,ceeq_question_id) VALUES (2,1),(2,2),(2,3),(2,4),(2,5)
2024-12-05 14:49:58 213 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):

RECORD LOCKS space id 115 page no 4 n bits 72 index ceeq_event_question of table `wikidb`.`ce_event_questions` trx id 1779 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

2024-12-05 14:49:58 213 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 115 page no 4 n bits 72 index ceeq_event_question of table `wikidb`.`ce_event_questions` trx id 1779 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

2024-12-05 14:49:58 213 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)

I have already investigated this and will write a report in a comment. But the key takeaway is that the relevant code is prone to lock contention, and this is probably also happening in production (and not just tests).

Event Timeline

The relevant code is in EventQuestionsStore::replaceEventQuestions, which does three things:

  • Get a list of current event questions with a locking read
  • Delete rows for questions that have been removed
  • Add new rows for questions that have been added

For example:

SELECT * FROM `ce_event_questions` WHERE ceeq_event_id=3 FOR UPDATE; -- Always
DELETE FROM `ce_event_questions` WHERE ceeq_id IN ( 10, 20, 30 ); -- Delete by PK if anything must be removed
INSERT INTO `ce_event_questions` (ceeq_event_id,ceeq_question_id) VALUES (3,1),(3,2),(3,3),(3,4),(3,5); -- Add new, if anything to add

Ironically, this three-step update is done to avoid lock contention, and yet we're seeing the opposite! Taking a closer look at the table definition:

CREATE TABLE /*_*/ce_event_questions (
  ceeq_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  ceeq_event_id BIGINT UNSIGNED NOT NULL,
  ceeq_question_id INT UNSIGNED NOT NULL,
  UNIQUE INDEX ceeq_event_question (ceeq_event_id, ceeq_question_id),
  PRIMARY KEY(ceeq_id)
) /*$wgDBTableOptions*/;

We can see that it has a unique index on the inserted values. Conveniently, just a few months ago a deadlock on the user table, also involving a unique index, has been investigated in T199393 (see T199393#10089626 and following comments). That one used LOCK IN SHARE MODE, not FOR UPDATE, but the underlying cause is basically the same. I strongly suggest reading the detailed report in that task, but here's a summary of what's happening.

  • The table starts off as empty
  • Thread 1 wants to insert data for event 1, so it runs a locking select:
SELECT * FROM `ce_event_questions` WHERE ceeq_event_id=1 FOR UPDATE;
  • Because the table is empty, no actual records are being locked. However, MariaDB will acquire an insert intention lock. This is effectively a gap lock, but we have no gap and the requested record is non-existent. So, MariaDB simply locks the next empty position in the index. In this case, there's no "next" free position, and so it locks the "supremum" pseudo-entry, essentially an open interval from the last existing record to +infinity.
  • Then, thread 2 wants to insert data for event 2 and it runs its own locking select. This also results in an exclusive lock on the supremum entry.
SELECT * FROM `ce_event_questions` WHERE ceeq_event_id=2 FOR UPDATE;
  • Next, thread 1 wants to insert its own data. But it can't, because thread 2 locked the same "gap", so it hangs.
INSERT INTO `ce_event_questions` (ceeq_event_id,ceeq_question_id) VALUES (1,1),(1,2),(1,3),(1,4),(1,5);
  • Next, thread 2 tries to insert its data, but thread 1's insert is still pending, and so we get a deadlock
INSERT INTO `ce_event_questions` (ceeq_event_id,ceeq_question_id) VALUES (2,1),(2,2),(2,3),(2,4),(2,5);

This can be reproduced easily:

$ mariadb --user root -proot --prompt 'A> '
$ mariadb --user root -proot --prompt 'B> '

A> create table t( t_pk BIGINT UNSIGNED AUTO_INCREMENT NOT NULL, t_val BIGINT UNSIGNED NOT NULL, UNIQUE INDEX t_val_i (t_val), PRIMARY KEY(t_pk) );

A> begin;
B> begin;

A> select * from t where t_val=1 FOR UPDATE;
--ok
B> select * from t where t_val=2 FOR UPDATE;
--ok

A> insert into t (t_val) values (1);
-- hangs
B> insert into t (t_val) values (2);
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

In fact, it turns out the index doesn't even need to be unique. All that matters is that FOR UPDATE is used in a query that is not matching any rows. This ends up locking the supremum record, and therefore preventing any concurrent operations. There would be no deadlock if any of the SELECTs actually matched something. Once I figured this out, a quick search on the internet returned plenty of questions about this behaviour. For example, see https://mysqlquicksand.wordpress.com/2019/12/20/select-for-update-on-non-existent-rows/.

As I mentioned in the task description, this also suggests that the same deadlock would be observable in production, if for some reason we get high concurrency. And it can happen on any table on which we use a similar pattern, including ce_question_answers, ce_participants, and ce_tracking_tools. For some of these, high concurrency would at times be possible (for example, ce_participants if lots of people register for an event at the same time).

And as for fixing: we need to avoid FOR UPDATE on rows that may not exist. There is just no way to avoid the next-key lock if the SELECT does not find any rows. I'll look into the specific queries to see what change would make the most sense. In the majority (or all) of the usages, I suspect that we could simply drop the FOR UPDATE and let the last update win. We could make sure that we're reading from the master DB (and not a replica) to avoid incurring in replication lag issues, but other than that, simultaneous updates shouldn't be too bad.

Change #1101533 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Avoid deadlock-prone `SELECT ... FOR UPDATE` queries

https://gerrit.wikimedia.org/r/1101533

Daimona renamed this task from Avoid deadlocks in CampaignEvents tables with unique indexes to CampaignEvents: Avoid deadlocks due to FOR UPDATE queries that may find no records.Mon, Dec 9, 3:17 PM

Change #1101533 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Avoid deadlock-prone `SELECT ... FOR UPDATE` queries

https://gerrit.wikimedia.org/r/1101533