Page MenuHomePhabricator

Reproducible deadlock in User::addToDatabase() when api.php?action=createaccount is called simultaneously by several users
Open, MediumPublic

Description

I can see reproducible (testcase is attached) DBQueryError exceptions when multiple users try to CreateAccount via API simultaneously.

Function: User::addToDatabase\nError: 1213 Deadlock found when trying to get lock; try restarting transaction

I've been running a Selenium test in 5 browsers. The test starts with creating an account via api.php?action=createaccount, which sometimes fails with DBQueryError failures.
So I wrote a small testcase "send 10 async requests to api.php?action=createaccount". It fails 100% of the time (both for latest MediaWiki 1.31 and legacy MediaWiki 1.27).

Reproducible with 10.2.16-MariaDB, 10.3.8-MariaDB, MySQL 5.6.

Testcase to reproduce

npm install
API=http://somewiki/w/api.php node ReproduceBug.js

Related Objects

Event Timeline

DB error details:

2018-07-12 01:17:45 localhost.localdomain modwiki31: DB error: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading?
Query: INSERT IGNORE INTO user (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('Test User 1531358264587 0.8544155829717672','','','',NULL,'','94505107f7a1a7a3f08348f5e494bec0','20180712011745','0','20180712011750')
Function: User::addToDatabase
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (127.0.0.1)

Stacktrace (MediaWiki 1.31):

Wikimedia\Rdbms\DBQueryError at /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1457)
#0 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1427): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1200): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(1966): Wikimedia\Rdbms\Database->query(string, string)
#3 /var/www/html/modwiki31/includes/user/User.php(4322): Wikimedia\Rdbms\Database->insert(string, array, string, string)
#4 [internal function]: User->{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
#5 /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php(3664): call_user_func_array(Closure, array)
#6 /var/www/html/modwiki31/includes/user/User.php(4350): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure)
#7 /var/www/html/modwiki31/includes/auth/AuthManager.php(1404): User->addToDatabase()
#8 /var/www/html/modwiki31/includes/auth/AuthManager.php(1131): MediaWiki\Auth\AuthManager->continueAccountCreation(array)
#9 /var/www/html/modwiki31/includes/api/ApiAMCreateAccount.php(93): MediaWiki\Auth\AuthManager->beginAccountCreation(User, array, string)
#10 /var/www/html/modwiki31/includes/api/ApiMain.php(1579): ApiAMCreateAccount->execute()
#11 /var/www/html/modwiki31/includes/api/ApiMain.php(535): ApiMain->executeAction()
#12 /var/www/html/modwiki31/includes/api/ApiMain.php(506): ApiMain->executeActionWithErrorHandling()
#13 /var/www/html/modwiki31/api.php(94): ApiMain->execute()
#14 {main}

Does the deadlock also happen if multiple users use the web UI Special:CreateAccount?

Yes, the deadlock was reproduced on Special:CreateAccount too. Testcase is attached.

$ API=http://modwiki31.example.com/modwiki31/api.php CREATEACCOUNT_URL=http://modwiki31.example.com/wiki/Special:CreateAccount node ReproduceBugNonApi.js

HTTP error 500
.\nThis may indicate a bug in the software.</p><div class=\"errorbox mw-content-ltr\"><p>[W0dtGZIf@DUkZomt06SRTwAAAAU] /wiki/Special:CreateAccount Wikimedia\\Rdbms\\DBQueryError from line 1457 of /var/www/html/modwiki31/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? <br />\nQuery: INSERT IGNORE INTO user [...]

Using MariaDB 10.1.29 locally, I narrowed it down to the following series of queries being executed (intermixed with many other queries) as part of the account creation process:

A: BEGIN;
A: SELECT user_id  FROM `user` WHERE user_name = 'T199393.153140698912290'  LIMIT 1   LOCK IN SHARE MODE;

B: BEGIN;
B: SELECT user_id  FROM `user` WHERE user_name = 'T199393.153140698912291'  LIMIT 1   LOCK IN SHARE MODE;

A: INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912290','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954');
-- A is now blocking

B: INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912291','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954');
-- B reports a deadlock, A is now able to continue

You can replace the "LOCK IN SHARE MODE" with "FOR UPDATE" without changing the behavior.

If I change the usernames to be less similar the deadlock doesn't occur.

Running MariaDB with innodb_print_all_deadlocks, the deadlock reports:

2018-07-12 11:17:06 7f60b4c98700InnoDB: transactions deadlock detected, dumping detailed information.
2018-07-12 11:17:06 7f60b4c98700
*** (1) TRANSACTION:
TRANSACTION 36589276, ACTIVE 21 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 205, OS thread handle 0x7f60b79d0700, query id 2571 localhost root update
INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912290','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1189014 page no 4 n bits 144 index `user_name` of table `mywiki`.`user` trx table locks 2 total table locks 4  trx id 36589276 lock_mode X locks gap before rec insert intention waiting lock hold time 4 wait time before grant 0
*** (2) TRANSACTION:
TRANSACTION 36589277, ACTIVE 10 sec inserting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 206, OS thread handle 0x7f60b4c98700, query id 2572 localhost root update
INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('T199393.153140698912291','','','',NULL,'','b0ad88565a69596f779b0eb41607cc12','20180712144949','0','20180712144954')
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1189014 page no 4 n bits 144 index `user_name` of table `mywiki`.`user` trx table locks 2 total table locks 4  trx id 36589277 lock mode S locks gap before rec lock hold time 10 wait time before grant 0
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1189014 page no 4 n bits 144 index `user_name` of table `mywiki`.`user` trx table locks 2 total table locks 4  trx id 36589277 lock_mode X locks gap before rec insert intention waiting lock hold time 0 wait time before grant 0
*** WE ROLL BACK TRANSACTION (2)

(using FOR UPDATE changes that "lock mode S" to "lock mode X").

It seems the locking selects for nonexistent usernames both lock the "gap" between two existing usernames rather than the specific username queried, and they can both lock the same gap without blocking each other. Then the inserts each block on the other transaction's gap lock.

Tagged for CPT to look at as part of "clinic duty". This is causing some of the new API integration tests to fail.

Change 530343 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/tools/api-testing@master] Work around T199393.

https://gerrit.wikimedia.org/r/530343

You can replace the "LOCK IN SHARE MODE" with "FOR UPDATE" without changing the behavior.

It's surprising at first glance that SELECT ... FOR UPDATE does not block, but it is supported by the MariaDB manual:

  • An X lock is not granted if any other lock (X, S, IX, IS) is held.
  • An S lock is not granted if an X or IX lock is held. It is granted if an S or IS lock is held.
  • An IX lock is not granted if in X or S lock is held. It is granted if an IX or IS lock is held.
  • An IS lock is not granted if an X lock is held. It is granted if an S, IX or IS lock is held.

With LOCK IN SHARE MODE, the locks are both IS. With FOR UPDATE, the locks are both IX. Either way, they don't block each other, but an attempted upgrade to an X lock will block. This is reproducible with empty tables:

A> CREATE TABLE t (f INT PRIMARY KEY);
Query OK, 0 rows affected (0.04 sec)

A> BEGIN;
Query OK, 0 rows affected (0.00 sec)

A> SELECT * FROM t WHERE f=1 FOR UPDATE;
Empty set (0.00 sec)

B> SELECT * FROM t WHERE f=1 FOR UPDATE;
Empty set (0.00 sec)

A> INSERT INTO t VALUES (1);
-- waits
B> INSERT INTO t VALUES (1);
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

The existing workaround (e.g. documented in LocalFile.php line 1492) is to speculatively use INSERT IGNORE to lock the row, being careful to avoid SELECT FOR UPDATE. Both User::addToDatabase() and User::createNew() are similarly structured, starting with a speculative INSERT IGNORE and checking the number of affected rows, but it seems that policy has been lost along the way. I'm not sure of the exact call stack leading to the locking select -- maybe it is AuthManager::beginAccountCreation() -> canCreateAccount(READ_LOCKING) -> LocalPasswordPrimaryAuthenticationProvider::testUserExists(READ_LOCKING)?

Basically, I would get rid of all the locking reads. I don't understand what they are there for. Someone (@aaron?) had better explain it to me before I start randomly deleting READ_LOCKING options.

On CI, Quibble exposes MediaWiki with the php builtin webserver which handles the requests serially. So I guess that kind of forces the users to be created serially. We want to switch to Apache (T225218) and the experimental job we have keeps failing on the UserContribs.js tests.

I guess @daniel workaround patch should be rebased / merged https://gerrit.wikimedia.org/r/c/mediawiki/tools/api-testing/+/530343/ . An alternative is to mark the tests broken / skip them for now?

Change 530343 abandoned by Awight:
[mediawiki/tools/api-testing@master] Work around T199393.

Reason:
These files have moved to mediawiki-core

https://gerrit.wikimedia.org/r/530343

Change 654417 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] Avoid Promise.all when creating users to prevent deadlock

https://gerrit.wikimedia.org/r/654417

Change 654421 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/tools/api-testing@master] [WIP] Prevent concurrent account creation

https://gerrit.wikimedia.org/r/654421

Change 654417 merged by jenkins-bot:
[mediawiki/core@master] Avoid Promise.all when creating users to prevent deadlock

https://gerrit.wikimedia.org/r/654417

Krinkle added a subscriber: Krinkle.

[…]

Basically, I would get rid of all the locking reads. I don't understand what they are there for. Someone (@aaron?) had better explain it to me before I start randomly deleting READ_LOCKING options.

This sounds like a good oppertunity for our teams to collaborate and help each other improve stability and code health. Perhaps we can schedule this for next time you're on clinic duty for you and Aaron to touch base on this?

This is also seen in api-testing tests when run in parallel (https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php72-docker/25089/console):

00:01:19.071   9) Categories
00:01:19.071        "before all" hook for "can be added using wikitext syntax":
00:01:19.072      AssertionError: User "<anon>": Action "createaccount" returned error code "internal_api_error_DBQueryError": [952d1585f8f327b14dca4e02] Exception caught: A database query error has occurred. This may indicate a bug in the software.!
00:01:19.072       at Client.action (node_modules/api-testing/lib/actionapi.js:125:20)
00:01:19.072       at processTicksAndRejections (internal/process/task_queues.js:95:5)
00:01:19.072       at async Client.createAccount (node_modules/api-testing/lib/actionapi.js:423:24)
00:01:19.072       at async Object.user (node_modules/api-testing/lib/action_clients.js:79:25)
00:01:19.072       at async Context.<anonymous> (tests/api-testing/action/Categories.js:18:11)
00:01:19.072

Change 752143 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] Speculative fix for deadlock in User::addToDatabase() when api.php?action=createaccount

https://gerrit.wikimedia.org/r/752143

Change 752149 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/tools/api-testing@master] Prefix titles/usernames with random strings

https://gerrit.wikimedia.org/r/752149

In T199393#4420174 @Anomie got more detailed via MariaDB innodb_print_all_deadlocks and the RECORD LOCKS entries have:

n bits144
indexuser_name

The test used T199393.153140698912290 and T199393.153140698912291 and Anomie mentions they end up locking the same gap. It might be the n bits and tentatively the lock would be held solely against the first 144 bits (or 18 bytes) of the field. Both usernames share the same first 18 bytes: T199393.1531406989

That leads me to https://mariadb.com/kb/en/innodb-lock-modes/#gap-locks or https://dev.mysql.com/doc/refman/8.0/en/innodb-locking.html#innodb-gap-locks

The doc states:

If a statement uses all the columns of a unique index to search for unique row, gap locking is not used.

The user_name field as a unique index, MySQL states at https://dev.mysql.com/doc/refman/8.0/en/innodb-locks-set.html that there is no gap lock for INSERT but there is still an "intent lock".

The usernames are made unique using a timestamp, maybe we can just reverse the username or we could provide a functionality in wdio-mediawiki which would provide a unique username which first bytes vary enough that we would not be subject to the gap locking. Then looking at the api testing database errors:

From User::addToDatabase, Deadlock found when trying to get lock; try restarting transaction
INSERT IGNORE INTO user (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched)

It had the following values:

VALUES ('Mindyoj759hBix9','','','',NULL,'','ce1a2dc893f4e96a910874933bfba74e','20220106212220',0,'20220106212220')
VALUES ('SamBSfS8YKS4c','','','',NULL,'','f95af29cde6caa2f166e1f9f0f03b5f3','20220106212220',0,'20220106212220')
VALUES ('MindykQGNKOhn9Y','','','',NULL,'','81279df004949914ef099cd4045cd231','20220106212220',0,'20220106212220')
VALUES ('AliceIvuoxLxXnE','','','',NULL,'','0f61b5f2fcd9d0cecad9aa76e724d43f','20220106212220',0,'20220106212220')
VALUES ('Alice3CJMPIaK17','','','',NULL,'','1874ed0a7917d4f159ecccc796b900d9','20220106212220',0,'20220106212220')
VALUES ('AliceV6V9Sv3C5O','','','',NULL,'','edb918b4b33d9c58e713ddcaf7dbaa73','20220106212220',0,'20220106212220')
VALUES ('63VHHrv1yMPFmpVyAt','','','',NULL,'','6457e8e0ab304a5ff0ae8ee3aadcb779','20220106212220',0,'20220106212220')
VALUES ('Kam r3DFlznNch','','','',NULL,'','fe1fd4bd8efd4ad23e0decf585096c07','20220106212221',0,'20220106212221')
VALUES ('BobntBsLlTnG8','','','',NULL,'','be3b457e09b49fe5232dec7d03a97a52','20220106212223',0,'20220106212223')
VALUES ('Fiona ikexEGwES6','','','',NULL,'','0e2f1e984e61c23aefa9486d949f0339','20220106212223',0,'20220106212223')

The user_name entries are largely different, but maybe those are due to a lock on user_registration and or user_touched :\

Change 752149 abandoned by Kosta Harlan:

[mediawiki/tools/api-testing@master] Prefix titles/usernames with random strings

Reason:

This doesn't work

https://gerrit.wikimedia.org/r/752149

Change 752356 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/tools/api-testing@master] Retry account creation to work around deadlock

https://gerrit.wikimedia.org/r/752356

Change 752143 abandoned by Kosta Harlan:

[mediawiki/core@master] Speculative fix for deadlock in User::addToDatabase() when api.php?action=createaccount

Reason:

This doesn't work. I think it's going to be more reasonable to deal with this in the testing tool by retrying creation attempts. Id309705ead69026ccc26bdf377c36bbf66dda372

https://gerrit.wikimedia.org/r/752143

Change 752356 merged by jenkins-bot:

[mediawiki/tools/api-testing@master] Retry account creation to work around deadlock

https://gerrit.wikimedia.org/r/752356

@kostajh indeed looks like blindly retrying the account creation is the easiest path forward. Well at least for api-testing / CI Thx!

Change 753533 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/DiscussionTools@master] PreferenceHooks: Don't call saveSettings in LocalUserCreated

https://gerrit.wikimedia.org/r/753533

Change 753534 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/Echo@master] EchoHooks: Don't call saveSettings in LocalUserCreated

https://gerrit.wikimedia.org/r/753534

Change 753522 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] [WIP] Run saveOptionsInternal in DeferredUpdate

https://gerrit.wikimedia.org/r/753522

Change 753534 merged by jenkins-bot:

[mediawiki/extensions/Echo@master] EchoHooks: Don't call saveSettings in LocalUserCreated

https://gerrit.wikimedia.org/r/753534

Change 753533 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] PreferenceHooks: Don't call saveSettings in LocalUserCreated

https://gerrit.wikimedia.org/r/753533

Change 802189 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] auth: Improve documentation on LocalUserCreatedHook

https://gerrit.wikimedia.org/r/802189

Change 802189 merged by jenkins-bot:

[mediawiki/core@master] auth: Improve documentation on LocalUserCreatedHook

https://gerrit.wikimedia.org/r/802189