Page MenuHomePhabricator

Reproducible deadlock in User::addToDatabase() when api.php?action=createaccount is called simultaneously by several users
Open, MediumPublic

Description

I can see reproducible (testcase is attached) DBQueryError exceptions when multiple users try to CreateAccount via API simultaneously.

Function: User::addToDatabase\nError: 1213 Deadlock found when trying to get lock; try restarting transaction

I've been running a Selenium test in 5 browsers. The test starts with creating an account via api.php?action=createaccount, which sometimes fails with DBQueryError failures.
So I wrote a small testcase "send 10 async requests to api.php?action=createaccount". It fails 100% of the time (both for latest MediaWiki 1.31 and legacy MediaWiki 1.27).

Reproducible with 10.2.16-MariaDB, 10.3.8-MariaDB, MySQL 5.6.

Testcase to reproduce

npm install
API=http://somewiki/w/api.php node ReproduceBug.js

Event Timeline

DB error details:

2018-07-12 01:17:45 localhost.localdomain modwiki31: DB error: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading?
Query: INSERT IGNORE INTO user (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('Test User 1531358264587 0.8544155829717672','','','',NULL,'','94505107f7a1a7a3f08348f5e494bec0','20180712011745','0','20180712011750')
Function: User::addToDatabase
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (127.0.0.1)

Stacktrace (MediaWiki 1.31):

Wikimedia\Rdbms\DBQueryError at /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1457)
#0 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1427): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1200): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1966): Wikimedia\Rdbms\Database->query(string, string)
#3 /var/www/html/modwiki31/includes/user/User.php(4322): Wikimedia\Rdbms\Database->insert(string, array, string, string)
#4 [internal function]: User->{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
#5 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(3664): call_user_func_array(Closure, array)
#6 /var/www/html/modwiki31/includes/user/User.php(4350): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure)
#7 /var/www/html/modwiki31/includes/auth/AuthManager.php(1404): User->addToDatabase()
#8 /var/www/html/modwiki31/includes/auth/AuthManager.php(1131): MediaWiki\Auth\AuthManager->continueAccountCreation(array)
#9 /var/www/html/modwiki31/includes/api/ApiAMCreateAccount.php(93): MediaWiki\Auth\AuthManager->beginAccountCreation(User, array, string)
#10 /var/www/html/modwiki31/includes/api/ApiMain.php(1579): ApiAMCreateAccount->execute()
#11 /var/www/html/modwiki31/includes/api/ApiMain.php(535): ApiMain->executeAction()
#12 /var/www/html/modwiki31/includes/api/ApiMain.php(506): ApiMain->executeActionWithErrorHandling()
#13 /var/www/html/modwiki31/api.php(94): ApiMain->execute()
#14 {main}

Does the deadlock also happen if multiple users use the web UI Special:CreateAccount?

Yes, the deadlock was reproduced on Special:CreateAccount too. Testcase is attached.

$ API=http://modwiki31.example.com/modwiki31/api.php CREATEACCOUNT_URL=http://modwiki31.example.com/wiki/Special:CreateAccount node ReproduceBugNonApi.js

HTTP error 500
.\nThis may indicate a bug in the software.</p><div class=\"errorbox mw-content-ltr\"><p>[W0dtGZIf@DUkZomt06SRTwAAAAU] /wiki/Special:CreateAccount Wikimedia\\Rdbms\\DBQueryError from line 1457 of /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? <br />\nQuery: INSERT IGNORE INTO user [...]

Using MariaDB 10.1.29 locally, I narrowed it down to the following series of queries being executed (intermixed with many other queries) as part of the account creation process:

A: BEGIN;
A: SELECT user_id  FROM `user` WHERE user_name = 'T199393.153140698912290'  LIMIT 1   LOCK IN SHARE MODE;

B: BEGIN;
B: SELECT user_id  FROM `user` WHERE user_name = 'T199393.153140698912291'  LIMIT 1   LOCK IN SHARE MODE;

A: INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912290','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954');
-- A is now blocking

B: INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912291','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954');
-- B reports a deadlock, A is now able to continue

You can replace the "LOCK IN SHARE MODE" with "FOR UPDATE" without changing the behavior.

If I change the usernames to be less similar the deadlock doesn't occur.

Running MariaDB with innodb_print_all_deadlocks, the deadlock reports:

2018-07-12 11:17:06 7f60b4c98700InnoDB: transactions deadlock detected, dumping detailed information.
2018-07-12 11:17:06 7f60b4c98700
*** (1) TRANSACTION:
TRANSACTION 36589276, ACTIVE 21 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 205, OS thread handle 0x7f60b79d0700, query id 2571 localhost root update
INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912290','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1189014 page no 4 n bits 144 index `user_name` of table `mywiki`.`user` trx table locks 2 total table locks 4  trx id 36589276 lock_mode X locks gap before rec insert intention waiting lock hold time 4 wait time before grant 0
*** (2) TRANSACTION:
TRANSACTION 36589277, ACTIVE 10 sec inserting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 206, OS thread handle 0x7f60b4c98700, query id 2572 localhost root update
INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912291','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954')
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1189014 page no 4 n bits 144 index `user_name` of table `mywiki`.`user` trx table locks 2 total table locks 4  trx id 36589277 lock mode S locks gap before rec lock hold time 10 wait time before grant 0
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1189014 page no 4 n bits 144 index `user_name` of table `mywiki`.`user` trx table locks 2 total table locks 4  trx id 36589277 lock_mode X locks gap before rec insert intention waiting lock hold time 0 wait time before grant 0
*** WE ROLL BACK TRANSACTION (2)

(using FOR UPDATE changes that "lock mode S" to "lock mode X").

It seems the locking selects for nonexistent usernames both lock the "gap" between two existing usernames rather than the specific username queried, and they can both lock the same gap without blocking each other. Then the inserts each block on the other transaction's gap lock.

Tagged for CPT to look at as part of "clinic duty". This is causing some of the new API integration tests to fail.

Change 530343 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/tools/api-testing@master] Work around T199393.

https://gerrit.wikimedia.org/r/530343

You can replace the "LOCK IN SHARE MODE" with "FOR UPDATE" without changing the behavior.

It's surprising at first glance that SELECT ... FOR UPDATE does not block, but it is supported by the MariaDB manual:

  • An X lock is not granted if any other lock (X, S, IX, IS) is held.
  • An S lock is not granted if an X or IX lock is held. It is granted if an S or IS lock is held.
  • An IX lock is not granted if in X or S lock is held. It is granted if an IX or IS lock is held.
  • An IS lock is not granted if an X lock is held. It is granted if an S, IX or IS lock is held.

With LOCK IN SHARE MODE, the locks are both IS. With FOR UPDATE, the locks are both IX. Either way, they don't block each other, but an attempted upgrade to an X lock will block. This is reproducible with empty tables:

A> CREATE TABLE t (f INT PRIMARY KEY);
Query OK, 0 rows affected (0.04 sec)

A> BEGIN;
Query OK, 0 rows affected (0.00 sec)

A> SELECT * FROM t WHERE f=1 FOR UPDATE;
Empty set (0.00 sec)

B> SELECT * FROM t WHERE f=1 FOR UPDATE;
Empty set (0.00 sec)

A> INSERT INTO t VALUES (1);
-- waits
B> INSERT INTO t VALUES (1);
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

The existing workaround (e.g. documented in LocalFile.php line 1492) is to speculatively use INSERT IGNORE to lock the row, being careful to avoid SELECT FOR UPDATE. Both User::addToDatabase() and User::createNew() are similarly structured, starting with a speculative INSERT IGNORE and checking the number of affected rows, but it seems that policy has been lost along the way. I'm not sure of the exact call stack leading to the locking select -- maybe it is AuthManager::beginAccountCreation() -> canCreateAccount(READ_LOCKING) -> LocalPasswordPrimaryAuthenticationProvider::testUserExists(READ_LOCKING)?

Basically, I would get rid of all the locking reads. I don't understand what they are there for. Someone (@aaron?) had better explain it to me before I start randomly deleting READ_LOCKING options.

On CI, Quibble exposes MediaWiki with the php builtin webserver which handles the requests serially. So I guess that kind of forces the users to be created serially. We want to switch to Apache (T225218) and the experimental job we have keeps failing on the UserContribs.js tests.

I guess @daniel workaround patch should be rebased / merged https://gerrit.wikimedia.org/r/c/mediawiki/tools/api-testing/+/530343/ . An alternative is to mark the tests broken / skip them for now?

Change 530343 abandoned by Awight:
[mediawiki/tools/api-testing@master] Work around T199393.

Reason:
These files have moved to mediawiki-core

https://gerrit.wikimedia.org/r/530343

Change 654417 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] Avoid Promise.all when creating users to prevent deadlock

https://gerrit.wikimedia.org/r/654417

Change 654421 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/tools/api-testing@master] [WIP] Prevent concurrent account creation

https://gerrit.wikimedia.org/r/654421

Change 654417 merged by jenkins-bot:
[mediawiki/core@master] Avoid Promise.all when creating users to prevent deadlock

https://gerrit.wikimedia.org/r/654417

Krinkle added a subscriber: Krinkle.

[…]

Basically, I would get rid of all the locking reads. I don't understand what they are there for. Someone (@aaron?) had better explain it to me before I start randomly deleting READ_LOCKING options.

This sounds like a good oppertunity for our teams to collaborate and help each other improve stability and code health. Perhaps we can schedule this for next time you're on clinic duty for you and Aaron to touch base on this?