Page MenuHomePhabricator

UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery
Open, HighPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction (db1181)
Function: MediaWiki\User\UserOptionsManager::saveOptionsInternal
Query: INSERT IGNORE INTO `user_properties`
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(1779)
#0 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(1763): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(1737): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(1217): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(2543): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(2503): Wikimedia\Rdbms\Database->doInsertNonConflicting(string, array, string)
#5 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#6 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/DBConnRef.php(335): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.39.0-wmf.8/includes/user/UserOptionsManager.php(465): Wikimedia\Rdbms\DBConnRef->insert(string, array, string, array)
#8 /srv/mediawiki/php-1.39.0-wmf.8/includes/user/User.php(2853): MediaWiki\User\UserOptionsManager->saveOptionsInternal(User, Wikimedia\Rdbms\DBConnRef)
#9 /srv/mediawiki/php-1.39.0-wmf.8/includes/auth/AuthManager.php(1846): User->saveSettings()
#10 /srv/mediawiki/php-1.39.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUtilityService.php(164): MediaWiki\Auth\AuthManager->autoCreateUser(User, string, boolean, boolean)
#11 /srv/mediawiki/php-1.39.0-wmf.8/extensions/CentralAuth/includes/User/CentralAuthCreateLocalAccountJob.php(109): MediaWiki\Extension\CentralAuth\CentralAuthUtilityService->autoCreateUser(User)
#12 /srv/mediawiki/php-1.39.0-wmf.8/extensions/EventBus/includes/JobExecutor.php(79): MediaWiki\Extension\CentralAuth\User\CentralAuthCreateLocalAccountJob->run()
#13 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#14 {main}
Impact
Notes

A handful of these seen in 1.39.0-wmf.8 (T305214).

Similar:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
taavi added a subscriber: taavi.

This doesn't look like a CentralAuth issue. CA manages user creation in production so you're going to see it in production account creation issue stack traces, but preferences management is a core feature that CA does not handle.

Change 801813 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/PageTriage@master] Move saving user options to a POSTSEND DeferredUpdate

https://gerrit.wikimedia.org/r/801813

Funnily enough, documentation of the hook literally says this: https://www.mediawiki.org/wiki/Manual:Hooks/LocalUserCreated

Changes which are not immediately required for database consistency should probably be deferred; see Database transactions.

cc @Krinkle

Ladsgroup triaged this task as Medium priority.May 31 2022, 8:52 PM
Ladsgroup added a project: DBA.
Ladsgroup moved this task from Triage to In progress on the DBA board.

Change 801813 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Don't call saveOptions in LocalUserCreated

https://gerrit.wikimedia.org/r/801813

Change 802106 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/PageTriage@wmf/1.39.0-wmf.14] Don't call saveOptions in LocalUserCreated

https://gerrit.wikimedia.org/r/802106

Change 802107 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/PageTriage@wmf/1.39.0-wmf.13] Don't call saveOptions in LocalUserCreated

https://gerrit.wikimedia.org/r/802107

Change 802106 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@wmf/1.39.0-wmf.14] Don't call saveOptions in LocalUserCreated

https://gerrit.wikimedia.org/r/802106

Change 802107 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@wmf/1.39.0-wmf.13] Don't call saveOptions in LocalUserCreated

https://gerrit.wikimedia.org/r/802107

Mentioned in SAL (#wikimedia-operations) [2022-06-01T11:15:41Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.14/extensions/PageTriage/includes/Hooks.php: Backport: [[gerrit:802106|Don't call saveOptions in LocalUserCreated (T306636)]] (duration: 03m 01s)

Mentioned in SAL (#wikimedia-operations) [2022-06-01T11:21:02Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.13/extensions/PageTriage/includes/Hooks.php: Backport: [[gerrit:802107|Don't call saveOptions in LocalUserCreated (T306636)]] (duration: 03m 16s)

Change 802128 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Thanks@master] Don't call saveOptions in Hooks::onAccountCreated

https://gerrit.wikimedia.org/r/802128

Change 802128 merged by jenkins-bot:

[mediawiki/extensions/Thanks@master] Don't call saveOptions in Hooks::onAccountCreated

https://gerrit.wikimedia.org/r/802128

Change 802109 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Thanks@wmf/1.39.0-wmf.13] Don't call saveOptions in Hooks::onAccountCreated

https://gerrit.wikimedia.org/r/802109

Change 802110 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Thanks@wmf/1.39.0-wmf.14] Don't call saveOptions in Hooks::onAccountCreated

https://gerrit.wikimedia.org/r/802110

Change 802109 merged by jenkins-bot:

[mediawiki/extensions/Thanks@wmf/1.39.0-wmf.13] Don't call saveOptions in Hooks::onAccountCreated

https://gerrit.wikimedia.org/r/802109

Change 802110 merged by jenkins-bot:

[mediawiki/extensions/Thanks@wmf/1.39.0-wmf.14] Don't call saveOptions in Hooks::onAccountCreated

https://gerrit.wikimedia.org/r/802110

Mentioned in SAL (#wikimedia-operations) [2022-06-01T14:55:44Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.13/extensions/Thanks/includes/Hooks.php: Backport: [[gerrit:802109|Don't call saveOptions in Hooks::onAccountCreated (T306636)]] (duration: 03m 10s)

Mentioned in SAL (#wikimedia-operations) [2022-06-01T15:02:15Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.14/extensions/Thanks/includes/Hooks.php: Backport: [[gerrit:802110|Don't call saveOptions in Hooks::onAccountCreated (T306636)]] (duration: 03m 10s)

Change 802172 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Wikibase@master] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated

https://gerrit.wikimedia.org/r/802172

Change 802114 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Wikibase@wmf/1.39.0-wmf.14] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated

https://gerrit.wikimedia.org/r/802114

Change 802172 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated

https://gerrit.wikimedia.org/r/802172

Change 802114 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.39.0-wmf.14] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated

https://gerrit.wikimedia.org/r/802114

Mentioned in SAL (#wikimedia-operations) [2022-06-01T18:05:10Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.14/extensions/Wikibase/client: Backport: [[gerrit:802114|Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated (T306636)]] (duration: 03m 11s)

deadlocks have gone to basically zero.

image.png (136×950 px, 12 KB)

It's around 10% of its original flow, some left and I will check that once I'm back

Change 810000 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/LanguageSelector@master] Hooks: Avoid calling saveOptions in onLocalUserCreated

https://gerrit.wikimedia.org/r/810000

Ladsgroup lowered the priority of this task from Medium to Low.
Ladsgroup edited projects, added Data-Persistence (Consultation); removed DBA.

So I have reduced its volume to basically zero (found 4 cases in the past two days) and more work wouldn't be worth it. I leave it open for now.

Change 810000 merged by jenkins-bot:

[mediawiki/extensions/LanguageSelector@master] Hooks: Avoid calling saveOptions in onLocalUserCreated

https://gerrit.wikimedia.org/r/810000

So I have reduced its volume to basically zero (found 4 cases in the past two days) and more work wouldn't be worth it. I leave it open for now.

I took a look - only three reports for a day - logstash link for June 30

Zabe raised the priority of this task from Low to Medium.Aug 2 2022, 10:26 PM
Zabe added a subscriber: Zabe.

There were 119 of these in the last hour. All of them showed up in an automatic account creation of a local account in metawiki or loginwiki from CentralAuth.

Zabe raised the priority of this task from Medium to High.Aug 2 2022, 10:51 PM

There were 119 of these in the last hour. All of them showed up in an automatic account creation of a local account in metawiki or loginwiki from CentralAuth.

This rate actually only started rapidly increasing a few hours ago.

I have a feeling someone added saving user options again...

The error rate actually went down again. (/me is a bit confused)

T306636.png (241×674 px, 8 KB)

(error rate in the last 24 hours)

There are also some unneeded deletion on the preferences, which could hold locks as well, see T301506

There are also some unneeded deletion on the preferences, which could hold locks as well, see T301506

It's a misconception that deadlocks happen for holding locks too long. Holding a lock too long usually leads to lock timeout. It happens when to competing threads lock different things that depend on each other (one waits for the lock on A to be lifted while thread that holds the lock on A is waiting for the first thread to unlock something). That being said given that mediawiki wraps everything into a transaction, holding a lock for too long in some cases can lead to deadlocks.

Debugging deadlocks is a bit hard, they come in pairs but the error shows only one leg of the issue. Thankfully you can check the last deadlock in master (by SHOW ENGINE INNODB STATUS \G) and get more information. The last deadlock seen on s7 master is this:

2022-08-08 *redacted* *redacted*
*** (1) TRANSACTION:
TRANSACTION *redacted*, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 5
MySQL thread id *redacted*, OS thread handle *redacted*, query id *redacted* 10.64.16.199 wikiuser202206 Update
INSERT /* MediaWiki\User\UserOptionsManager::saveOptionsInternal  */ IGNORE INTO `user_properties` (up_user,up_property,up_value) VALUES (3******9,'rcenhancedfilters-seen-tour',1),(3******9,'wlenhancedf>
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id *redacted* page no *redacted* n bits 208 index PRIMARY of table `metawiki`.`user_properties` trx id *redacted* lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex *redacted*; asc supremum;;

*** (2) TRANSACTION:
TRANSACTION *redacted*, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
7 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 5
MySQL thread id *redacted*, OS thread handle *redacted*, query id *redacted* 10.64.48.79 wikiuser202206 Update
INSERT /* MediaWiki\User\UserOptionsManager::saveOptionsInternal  */ IGNORE INTO `user_properties` (up_user,up_property,up_value) VALUES (3******0,'rcenhancedfilters-seen-tour',1),(3******0,'wlenhancedf>
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id *redacted* page no *redacted* n bits *redacted* index PRIMARY of table `metawiki`.`user_properties` trx id *redacted* lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex *redacted*; asc supremum;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id *redacted* page no *redacted* n bits *redacted* index PRIMARY of table `metawiki`.`user_properties` trx id *redacted* lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex *redacted*; asc supremum;;

*** WE ROLL BACK TRANSACTION (2)

So it's because two users have been created too close to each other. Now we need to figure out why and how the insert ignore is holding the lock for different user (one up and one down), is it gap lock? I check it in depth soon.

This doesn't make any sense to me. These are insert intention locks on a non-autoincrement table. They should only be exclusive for the specific index record they are inserting; they are not supposed to prevent other transations from writing in the same gap. The primary key is (up_user, up_property) so unless there is something weird we don't see because of the query text being truncated, it should not be possible for these two queries to deadlock.

The inserted options rcenhancedfilters-seen-tour and wlenhancedfilters-seen-tour are set in LocalUserCreated hook (from the WikimediaMessages extensions, but there are more extensions as well).

But with the other fixes in this tasks the hooks provided the settings and there is only one save to the database per user id, so just the high load seems to result in the deadlock (maybe look up the user_registration value for that user id to show if there are very similiar)

The locks are called "eXclusive insert intentation"

It seems there is no explict documentation about locks on composite primary key and use of multi-row insert on that (with the extra that the value of the first column of that two-column primary key is always the same and avoid a big spread across the primary key index)

To make the gaps more reliable on inserts the rows-to-insert could be sorted descending on the primary key, so lower keys are in the gap between the highest key of the newly-inserted user id and the max user id on the database. That means there is never an insert of a new row into the gap between the current highest of the one down and the lowest of the one up where a conflict between two threads are possible (the code knows that both inserts are not overlapping, but the database does not know that)

This assumed the locks on multipe-row-insert are handled after each row, not after the whole statement. (But the highest key could be inserted with a single insert before doing the multi-row insert, if that assumption is wrong - with a little performance impact).

This seems not to happen on auto-increment primary key, because there is special lock handling. Thats why both users are inserted without deadlock in the user table.


To explain it with an example

Current situation with unordered values:
transaction 1 has to insert

42 C
42 D
42 A

transaction 2 has to insert

43 B
43 A
43 D

If both transaction running simultaneous the insert in the database could be

42 C
43 B

next step needs to insert between both rows

42 C
42 D <!-- new position transaction 1
43 A <!-- new position transaction 2
43 B

Possible resulting in the deadlock?

Better:
First step:

42 D
43 D

Next step

42 C <!-- new position transaction 1
42 D
43 B <!-- new position transaction 2
43 D

Change 821798 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] Sort user properties on insert

https://gerrit.wikimedia.org/r/821798

I don't think that is a thing.

session1> insert into user_properties VALUES (42, 'C', 'x'), (43, 'B', 'x');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

session1> begin;
Query OK, 0 rows affected (0.00 sec)

session1> begin;
Query OK, 0 rows affected (0.00 sec)

session1> insert into user_properties VALUES (42, 'D', 'x');
Query OK, 1 row affected (0.00 sec)

session2> insert into user_properties VALUES (43, 'A', 'x');
Query OK, 1 row affected (0.00 sec)

session1> commit;
Query OK, 0 rows affected (0.00 sec)

session2> commit;
Query OK, 0 rows affected (0.00 sec)

After reading docs and logs and checking the deadlock report and comparing it with other deadlock reports, I think I have an idea what this might be coming from.

The second lock is an insert intention lock. Seen as lock_mode X insert intention waiting but the first lock is not. It doesn't have intention waiting in it and more importantly the value of the lock has "supremum" in it, instead of the actual value. Compare this lock report with another deadlock I'm checking:

Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex *redacted*; asc supremum;;

vs.:

Record lock, heap no *redacted* PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 15; hex  *redacted*; asc Pages_with_maps;;
 1: len 4; hex  *redacted*; asc  pl ;;

(That's a lock on category table: T300303)

MySQL docs say it's a different type of lock called Next-Key Lock: https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/innodb-locking.html

And it happens when there is an index with the integer value (it would be a simple record lock if the value is string it seems). And it locks basically all values above the largest it has seen if there is an insert. And it would explain everything.

I couldn't reproduce the deadlock locally though, mostly because I can't locally make the insert go on next-key lock. The lock report in my localhost is completely different from what I'm seeing in production so I'm not sure our localhost experiments would work here.

I have no idea how we can fix this yet. Maybe do the insertion out of transaction to bypass the repeatable read and the highest value seen race condition. Or make it a separate job or a deferred update?

If it's a next-key lock, how do you get a deadlock? That's set on the gap before the row, so the process saving the options for the user with the larger ID could block the other process, if it locks the gap between the smaller user's last and the larger user's first record, but the smaller user's process has no way of blocking the other process.

In any case, can we decrease the isolation level? I don't think we need repeatable read here. Or would that interfere with replication?