This is a generalized version of T363284: Deadlock upon logging in in selenium tests with CentralAuth and SQLite, to address the underlying problem in CentralAuth as opposed to fixing the immediate problem in selenium. I wrote a detailed step-by-step investigation of a similar deadlock in T363284#10366655, and I won't repeat much of that here; so, I suggest reading that comment first.
Context
A typical configuration for CentralAuth is to use a separate centralauth database. This is also the recommendation in https://www.mediawiki.org/wiki/Extension:CentralAuth. Therefore, a typical request will likely need to query two databases: the local wiki database, and the centralauth database.
SQLite is the DB type used by MediaWiki-Docker and quickstart. For MW-Docker in particular, I made a guide on how to set up a wiki farm, and updated the CentralAuth instructions for SQLite. You can use these to reproduce the issue locally, if you don't already have a working multi-wiki environment.
Next, keep in mind that SQLite does not have record-level locking. As soon as we start a transaction with BEGIN IMMEDIATE, the whole database is locked. Other threads can't even start their own transactions because of the IMMEDIATE: it makes SQLite consider the transaction as a write transaction, right from the get go, regardless of whether any writes will be performed. This was done to mitigate other deadlocks in T89180 / T93097.
The problem
The above works just fine if you're not making concurrent requests. However, if you do make concurrent requests, you will occasionally see deadlocks. The reason is that certain requests will query the local wiki database first, and others will query the centralauth database first. This alone wouldn't be a problem, but it becomes one as soon as both requests try to query the "other" database (respectively centralauth and local) while the other request is still running; this will result in a deadlock.
Note that this isn't even specific to CentralAuth. However, CentralAuth is perhaps the best example of a separate, shared database that can cause this.
Test case
Reproducing this theoretical bug in the real world is doable, but not trivial. For starters, it requires that the two requests be sent at just the right time. And then, they need to access the two databases in reverse order. You can find more details in T363284#10366655, but I found that it can easily be reproduced by turning on the debug toolbar and disabling the cache. So, make sure you have the following in your LocalSettings.php:
$wgMainCacheType = CACHE_NONE; $wgMessageCacheType = CACHE_NONE; $wgParserCacheType = CACHE_NONE; $wgMainStash = CACHE_NONE; $wgResourceLoaderMaxage = [ 'versioned' => 0, 'unversioned' => 0 ]; $wgCachePages = false; ObjectCacheFactory::$localServerCacheClass = EmptyBagOStuff::class; $wgDebugToolbar = true;
This should be enough to reproduce the issue with concurrent API requests. Again, this is just an example of something that reproduces the bug for me. Your mileage may vary, and the underlying problem goes beyond specific triggers.
We will use selenium in MW core to reproduce this bug more consistently. Note that this is only done for convenience; there's nothing selenium-specific here. Create two tests with the following contents:
'use strict'; const Api = require( 'wdio-mediawiki/Api' ); describe( 'First', () => { it( 'makes an API request', async () => { await Api.bot(); } ); } );
'use strict'; const Api = require( 'wdio-mediawiki/Api' ); const BlankPage = require( 'wdio-mediawiki/BlankPage' ); describe( 'Second', () => { it( 'also makes an API request', async () => { await Api.bot(); await BlankPage.open(); } ); } );
Here already, the BlankPage.open() in the second test doesn't seem to be necessary; yet, in my local environment it makes the issue happen much more consistently.
Now we run these tests concurrently until they fail. It might take a while. For me, with the BlankPage line, it sometimes takes 10+ attempts to get a deadlock, and sometimes it's on the first try. Again, YMMV.
for i in $(seq 1 100); do echo "Attempt #$i"; DISPLAY= npm run selenium-test -- --specFileRetries 0 --maxInstances 2 --spec first.js --spec second.js || break; done
Eventually, the test will hang for 60 second and fail with a timeout. If you reduce SQLite's busy timeout as described in T363284#10366655, it will instead fail with:
[0-1] Error in "Second.also makes an API request" Error: invalidjson: No valid JSON response: at /home/emanuele/Git/core/node_modules/mwbot/src/index.js:254:31 at tryCatcher (/home/emanuele/Git/core/node_modules/bluebird/js/release/util.js:16:23) at Promise._settlePromiseFromHandler (/home/emanuele/Git/core/node_modules/bluebird/js/release/promise.js:547:31) at Promise._settlePromise (/home/emanuele/Git/core/node_modules/bluebird/js/release/promise.js:604:18) at Promise._settlePromise0 (/home/emanuele/Git/core/node_modules/bluebird/js/release/promise.js:649:10) at Promise._settlePromises (/home/emanuele/Git/core/node_modules/bluebird/js/release/promise.js:729:18) at _drainQueueStep (/home/emanuele/Git/core/node_modules/bluebird/js/release/async.js:93:12) at _drainQueue (/home/emanuele/Git/core/node_modules/bluebird/js/release/async.js:86:9) at Async._drainQueues (/home/emanuele/Git/core/node_modules/bluebird/js/release/async.js:102:5) at Async.drainQueues [as _onImmediate] (/home/emanuele/Git/core/node_modules/bluebird/js/release/async.js:15:14) at process.processImmediate (node:internal/timers:483:21)
Solution
I'm not sure, but I think this should be prevented at the RDBMS lib level. Maybe we should enforce a consistent order in resource access with SQLite?