Page MenuHomePhabricator

Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Spotted in the logs. It seems like getCentralTouched() gets a handle for the wrong database (e.g. selectDB). The method itself looks fine, so I'm not sure how that is possible. That should never happen.

A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: SELECT  page_touched,pp_propname  FROM `page` LEFT JOIN `page_props` ON ((page_id=pp_page) AND pp_propname = 'noglobal')   WHERE page_namespace = '2' AND page_title = 'Bujo'  LIMIT 1  
Function: MediaWiki\GlobalUserPage\GlobalUserPage::getCentralTouched
Error: 1146 Table 'centralauth.page' doesn't exist (10.64.0.91)

#0 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/database/Database.php(1422): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/database/Database.php(1195): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/database/Database.php(1648): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/database/Database.php(1725): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 [internal function]: Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/database/DBConnRef.php(49): call_user_func_array(array, array)
#6 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/database/DBConnRef.php(268): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.31.0-wmf.30/extensions/GlobalUserPage/includes/GlobalUserPage.php(212): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#8 /srv/mediawiki/php-1.31.0-wmf.30/extensions/GlobalUserPage/includes/GlobalUserPage.php(174): MediaWiki\GlobalUserPage\GlobalUserPage::getCentralTouched(User)
#9 /srv/mediawiki/php-1.31.0-wmf.30/extensions/GlobalUserPage/includes/Hooks.php(57): MediaWiki\GlobalUserPage\GlobalUserPage::shouldDisplayGlobalPage(Title)
#10 /srv/mediawiki/php-1.31.0-wmf.30/includes/Hooks.php(177): MediaWiki\GlobalUserPage\Hooks::onTitleIsAlwaysKnown(Title, NULL)
#11 /srv/mediawiki/php-1.31.0-wmf.30/includes/Hooks.php(205): Hooks::callHook(string, array, array, NULL)
#12 /srv/mediawiki/php-1.31.0-wmf.30/includes/Title.php(4589): Hooks::run(string, array)
#13 /srv/mediawiki/php-1.31.0-wmf.30/includes/Title.php(4630): Title->isAlwaysKnown()
#14 /srv/mediawiki/php-1.31.0-wmf.30/includes/skins/Skin.php(1200): Title->isKnown()
#15 /srv/mediawiki/php-1.31.0-wmf.30/includes/skins/SkinTemplate.php(220): Skin::makeUrlDetails(string)
#16 /srv/mediawiki/php-1.31.0-wmf.30/includes/skins/SkinTemplate.php(303): SkinTemplate->setupTemplateForOutput()
#17 /srv/mediawiki/php-1.31.0-wmf.30/includes/skins/SkinTemplate.php(249): SkinTemplate->prepareQuickTemplate()
#18 /srv/mediawiki/php-1.31.0-wmf.30/includes/OutputPage.php(2388): SkinTemplate->outputPage()
#19 /srv/mediawiki/php-1.31.0-wmf.30/includes/MediaWiki.php(868): OutputPage->output(boolean)
#20 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/lbfactory/LBFactory.php(496): Closure$MediaWiki::main()
#21 /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/lbfactory/LBFactory.php(153): Wikimedia\Rdbms\LBFactory->shutdownChronologyProtector(Wikimedia\Rdbms\ChronologyProtector, Closure$MediaWiki::main;1350, string, integer)
#22 /srv/mediawiki/php-1.31.0-wmf.30/includes/MediaWiki.php(634): Wikimedia\Rdbms\LBFactory->shutdown(integer, Closure$MediaWiki::main;1350, integer)
#23 /srv/mediawiki/php-1.31.0-wmf.30/includes/MediaWiki.php(575): MediaWiki::preOutputCommit(RequestContext, Closure$MediaWiki::main;1350)
#24 /srv/mediawiki/php-1.31.0-wmf.30/includes/MediaWiki.php(877): MediaWiki->doPreOutputCommit(Closure$MediaWiki::main;1350)
#25 /srv/mediawiki/php-1.31.0-wmf.30/includes/MediaWiki.php(524): MediaWiki->main()
#26 /srv/mediawiki/php-1.31.0-wmf.30/index.php(42): MediaWiki->run()
#27 /srv/mediawiki/w/index.php(3): include(string)

Details

SubjectRepoBranchLines +/-
mediawiki/coremaster+21 -22
mediawiki/extensions/OAuthRateLimitermaster+27 -14
mediawiki/extensions/WikimediaMaintenancemaster+1 -1
mediawiki/coreREL1_37+1 -1
mediawiki/coreREL1_38+1 -1
mediawiki/coremaster+1 -1
mediawiki/coremaster+3 -13
mediawiki/coremaster+2 -2
mediawiki/coremaster+11 -11
mediawiki/extensions/BounceHandlermaster+1 -2
mediawiki/extensions/Newslettermaster+16 -44
mediawiki/coremaster+661 -334
mediawiki/libs/RequestTimeoutmaster+165 -20
mediawiki/coreREL1_31+140 -4
mediawiki/corewmf/1.32.0-wmf.26+150 -4
mediawiki/coreREL1_32+150 -4
mediawiki/coremaster+150 -4
mediawiki/coremaster+18 -1
mediawiki/coremaster+342 -198
mediawiki/coremaster+2 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 667751 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/libs/RequestTimeout@master] Allow emergency timeouts to be overridden when creating the critical section

https://gerrit.wikimedia.org/r/667751

Change 667751 merged by jenkins-bot:
[mediawiki/libs/RequestTimeout@master] Allow emergency timeouts to be overridden when creating the critical section

https://gerrit.wikimedia.org/r/667751

Change 644636 merged by jenkins-bot:

[mediawiki/core@master] rdbms: detect corrupt Database instances due to critical section failure

https://gerrit.wikimedia.org/r/644636

Change 644636 merged by jenkins-bot:

[mediawiki/core@master] rdbms: detect corrupt Database instances due to critical section failure

https://gerrit.wikimedia.org/r/644636

Very rough analysis so far: Seemingly no difference. Is this expected?

This patch first went out in this weeks of 1.37.0-wmf.1. It does not appear to have solved or significantly changed the rate of Error: 1146 so far. With this week's wmf branch version having experience roughly the same number of these errors as the three wmf branches before it (quarter of the MW version pie chart).

https://logstash.wikimedia.org/goto/dccb5069d5f7b0efb3043a8dffe9d73e

Screenshot 2021-04-18 at 23.24.32.png (1×2 px, 310 KB)

The new edge case we're able to detect with this code seems to be hit:

[{reqId}] {exception_url} Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::runOnTransactionIdleCallbacks critical section while session state is out of sync. A critical section from Wikimedia\Rdbms\Database::executeQuery has

Couple of hits every few hours. Not much, but worth looking into perhaps:

https://logstash.wikimedia.org/goto/8f33a466461010cfd4cbd3d06df76063

It seems the mitigation of using a "critical section" did not resolve this issue. I vaguely recall that the "Clinic Duty: Inbox" column should not be used, so I've moved it to the main Inbox as reminder that this needs another round of analysis.

Understood that this needs to be solved. Putting back into Clinic Duty (Later) and leaving at High priority.

Change 519305 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: improve LoadBalancer connection pool reuse

https://gerrit.wikimedia.org/r/519305

Krinkle lowered the priority of this task from High to Medium.

I suspect it was to do with combining:

  • getConnection()
  • onTransactionCommitOrIdle() and friends
  • reuseConnection()

When using DBConnRef, this isn't an issue since the callback maintains a reference (delaying the automatic reuseConnection() call).

In shell.php with mysql:

$lbf = MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
$lb = $lbf->getMainLB( 'my_wiki' );
$c1 = $lb->getConnection( DB_PRIMARY, [], 'remote1_wiki' );
$lbf->beginPrimaryChanges( 'test' );
$c1->onTransactionCommitOrIdle( function () use ( $c1 ) { var_dump( $c1->getDomainID() ); } );
$c1->getDomainID(); # remote1_wiki
$lb->reuseConnection( $c1 );
$c2 = $lb->getConnection( DB_PRIMARY, [], 'remote2_wiki' );
$c2->getDomainID(); # remote2_wiki
$c1->getDomainID(); # remote2_wiki
$lbf->commitPrimaryChanges( 'test' ); # should be remote1_wiki

I get:

>>> 
>>> $lbf = MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
=> Wikimedia\Rdbms\LBFactoryMulti {#230}
>>> $lb = $lbf->getMainLB( 'my_wiki' );
=> Wikimedia\Rdbms\LoadBalancer {#252}
>>> $c1 = $lb->getConnection( DB_PRIMARY, [], 'remote1_wiki' );
=> Wikimedia\Rdbms\DatabaseMysqli {#2826}
>>> $lbf->beginPrimaryChanges( 'test' );
=> null
>>> $c1->onTransactionCommitOrIdle( function () use ( $c1 ) { var_dump( $c1->getDomainID() ); } );
=> null
>>> $c1->getDomainID(); # remote1_wiki
>>> $lb->reuseConnection( $c1 );
=> null
>>> $c2 = $lb->getConnection( DB_PRIMARY, [], 'remote2_wiki' );
=> Wikimedia\Rdbms\DatabaseMysqli {#2826}
>>> $c2->getDomainID(); # remote2_wiki
>>> $c1->getDomainID(); # remote2_wiki
>>> $lbf->commitPrimaryChanges( 'test' ); # should be remote1_wiki
string(12) "remote2_wiki"

@aaron If understand correctly, this scenario requires the failing code itself to be scheduled from onTransactionCommitOrIdle. The stack traces reported so far, do not involve onTransactionCommitOrIdle from what I can tell. They also all make use of DBConnRef, but maybe you see a way that code using DBConnRef could fail when something else used a connection without that earlier in the process?

From T293867:

normalized_message
[{reqId}] {exception_url}   UnexpectedValueException: Got connection to 'metawiki', but expected 'centralauth'
exception.trace
from /srv/mediawiki/php-1.38.0-wmf.4/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1316)
#0 /srv/mediawiki/php-1.38.0-wmf.4/includes/libs/rdbms/loadbalancer/LoadBalancer.php(987): Wikimedia\Rdbms\LoadBalancer->getForeignConnection(integer, string, integer)
#1 /srv/mediawiki/php-1.38.0-wmf.4/includes/libs/rdbms/loadbalancer/LoadBalancer.php(963): Wikimedia\Rdbms\LoadBalancer->getServerConnection(integer, string, integer)
#2 /srv/mediawiki/php-1.38.0-wmf.4/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1096): Wikimedia\Rdbms\LoadBalancer->getConnection(integer, array, string, integer)
#3 /srv/mediawiki/php-1.38.0-wmf.4/extensions/CentralAuth/includes/CentralAuthDatabaseManager.php(110): Wikimedia\Rdbms\LoadBalancer->getConnectionRef(integer, array, string)
#4 /srv/mediawiki/php-1.38.0-wmf.4/extensions/CentralAuth/includes/CentralAuthUtils.php(49): MediaWiki\Extension\CentralAuth\CentralAuthDatabaseManager->getCentralDB(integer)
#5 /srv/mediawiki/php-1.38.0-wmf.4/extensions/CentralAuth/includes/CentralAuthUser.php(555): CentralAuthUtils::getCentralReplicaDB()
#6 /srv/mediawiki/php-1.38.0-wmf.4/includes/libs/objectcache/wancache/WANObjectCache.php(1771): CentralAuthUser->{closure}(array, integer, array, double, array)
#7 /srv/mediawiki/php-1.38.0-wmf.4/includes/libs/objectcache/wancache/WANObjectCache.php(2766): WANObjectCache->fetchOrRegenerate(string, integer, Closure, array, array)
#8 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/MWCallableUpdate.php(38): WANObjectCache->{closure}()
#9 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/DeferredUpdates.php(515): MWCallableUpdate->doUpdate()
#10 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/DeferredUpdates.php(391): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#11 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/DeferredUpdates.php(221): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#12 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/DeferredUpdatesScope.php(267): DeferredUpdates::{closure}(MWCallableUpdate, integer)
#13 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/DeferredUpdatesScope.php(196): DeferredUpdatesScope->processStageQueue(integer, integer, Closure)
#14 /srv/mediawiki/php-1.38.0-wmf.4/includes/deferred/DeferredUpdates.php(242): DeferredUpdatesScope->processUpdates(integer, Closure)
#15 /srv/mediawiki/php-1.38.0-wmf.4/includes/MediaWiki.php(1136): DeferredUpdates::doUpdates()
#16 /srv/mediawiki/php-1.38.0-wmf.4/includes/MediaWiki.php(846): MediaWiki->restInPeace()
#17 /srv/mediawiki/php-1.38.0-wmf.4/api.php(125): MediaWiki->doPostOutputShutdown()
#18 /srv/mediawiki/php-1.38.0-wmf.4/api.php(45): wfApiMain()
#19 /srv/mediawiki/w/api.php(3): require(string)
#20 {main}

In the last 30 days, all instances of this have come from the same code path: A post-send deferred update that pre-emptively regenerates the cache from a CentralAuthUser::loadFromCache(), specifically the CentralAuthUtils::getCentralReplicaDB() call (source code).

	protected function loadFromCache() {
		$cache = MediaWikiServices::getInstance()->getMainWANObjectCache();
		$data = $cache->getWithSetCallback(
			$this->getCacheKey( $cache ),
			$cache::TTL_DAY,
			function ( $oldValue, &$ttl, array &$setOpts ) {
				$dbr = CentralAuthUtils::getCentralReplicaDB();

The requests are:

  • POST api.php request (47x)
  • GET rest.php request like /w/rest.php/en.wikipedia.org/v3/page/pagebundle/../..., always with v3/page/pagebundle.
  • GET index.php to view a very slow-to-parse user sub page on ukwiki (link)

But the traces are all the same:

Trace
Got connection to 'metawiki', but expected 'centralauth'
	
#0 /srv/mediawiki/php-1.38.0-wmf.17/includes/libs/rdbms/loadbalancer/LoadBalancer.php(987): LoadBalancer->getForeignConnection
#1 /srv/mediawiki/php-1.38.0-wmf.17/includes/libs/rdbms/loadbalancer/LoadBalancer.php(963): LoadBalancer->getServerConnection
#2 /srv/mediawiki/php-1.38.0-wmf.17/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1096): LoadBalancer->getConnection
#3 /srv/mediawiki/php-1.38.0-wmf.17/extensions/CentralAuth/includes/CentralAuthDatabaseManager.php(110): LoadBalancer->getConnectionRef
#4 /srv/mediawiki/php-1.38.0-wmf.17/extensions/CentralAuth/includes/CentralAuthUtils.php(56): CentralAuthDatabaseManager->getCentralDB(integer)
#5 /srv/mediawiki/php-1.38.0-wmf.17/extensions/CentralAuth/includes/User/CentralAuthUser.php(636): CentralAuthUtils::getCentralReplicaDB()
#6 /srv/mediawiki/php-1.38.0-wmf.17/includes/libs/objectcache/wancache/WANObjectCache.php(1689): CentralAuthUser->\{closure}
#7 /srv/mediawiki/php-1.38.0-wmf.17/includes/libs/objectcache/wancache/WANObjectCache.php(2708): WANObjectCache->fetchOrRegenerate
..
#10 /srv/mediawiki/php-1.38.0-wmf.17/includes/deferred/DeferredUpdates.php(392): DeferredUpdates::attemptUpdate
..
#15 /srv/mediawiki/php-1.38.0-wmf.17/includes/MediaWiki.php(1114): DeferredUpdates::doUpdates()

I recall seeing that it looked possible from DeferredUpdates too (similar kind of pattern to onTransaction*), though I didn't post a simplest repro case for that. Here it is:

$lbf = MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
$lb = $lbf->getMainLB( 'my_wiki' );
$c1 = $lb->getConnection( DB_PRIMARY, [], 'remote1_wiki' );
$lbf->beginPrimaryChanges( 'test' );
DeferredUpdates::addCallableUpdate( function () use ( $c1 ) { var_dump( $c1->getDomainID() ); } );
# Should be "remote1_wiki"
$c1->getDomainID();
$lb->reuseConnection( $c1 );
$c2 = $lb->getConnection( DB_PRIMARY, [], 'remote2_wiki' );
# Should be "remote2_wiki"
$c2->getDomainID();
# Should be "remote2_wiki"
$c1->getDomainID();
# Should print "remote1_wiki"
$lbf->commitPrimaryChanges( 'test' );
DeferredUpdates::tryOpportunisticExecute();

Yielding:

>>> 
>>> $lbf = MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
=> Wikimedia\Rdbms\LBFactoryMulti {#251}
>>> $lb = $lbf->getMainLB( 'my_wiki' );
=> Wikimedia\Rdbms\LoadBalancer {#266}
>>> $c1 = $lb->getConnection( DB_PRIMARY, [], 'remote1_wiki' );
=> Wikimedia\Rdbms\DatabaseMysqli {#2938}
>>> $lbf->beginPrimaryChanges( 'test' );
=> null
>>> DeferredUpdates::addCallableUpdate( function () use ( $c1 ) { var_dump( $c1->getDomainID() ); } );
=> null
>>> # Should be "remote1_wiki"
>>> $c1->getDomainID();
=> "remote1_wiki"
>>> $lb->reuseConnection( $c1 );
=> null
>>> $c2 = $lb->getConnection( DB_PRIMARY, [], 'remote2_wiki' );
=> Wikimedia\Rdbms\DatabaseMysqli {#2938}
>>> # Should be "remote2_wiki"
>>> $c2->getDomainID();
=> "remote2_wiki"
>>> # Should be "remote2_wiki"
>>> $c1->getDomainID();
=> "remote2_wiki"
>>> # Should print "remote1_wiki"
>>> $lbf->commitPrimaryChanges( 'test' );
=> null
>>> DeferredUpdates::tryOpportunisticExecute();
string(12) "remote2_wiki"
=> true

Change 755833 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/Newsletter@master] NewsletterDb: Replace manual connection management with DBConnRef

https://gerrit.wikimedia.org/r/755833

Change 755833 merged by jenkins-bot:

[mediawiki/extensions/Newsletter@master] NewsletterDb: Replace manual connection management with DBConnRef

https://gerrit.wikimedia.org/r/755833

@aaron I understand this repro. If I understand correctly, this repro requires three mistakes (or anti-patterns) to be made simultanously: 1) Code is not using getConnectionRef or DBConnRef, but manually getConnection. 2) Code is also manually using reuseConnection, 3) Code calls reuseConnection and then continues to use the same connection via a deferred closure.

The implicated code paths from GlobalUserPage and CentralAuth both use DBConnRef, whereas in your repro the code where the wrong db is used would itself not be using DBConnRef. Though, I suppose if they share a connection pool in LB, then the inverse of your repro could still happen and make the DBConnRef-user fail instead. I think I understand now why you were refactoring the connection pool logic in the rdbms patches two years ago :)

Are you aware of code doing the above in production currently? I agree we should detect the issue at the source so that the error will point us to the cause rather than the victim; but it also sounds like it we'd still have an error to fix. I can imagine it being hard to find, but just wanted to confirm whether you found the repro based on real code or not.

Code that uses getConnection() is expected to use reuseConnection(), the only problem is having the deferred update scheduled in the middle with a direct Database handle be given to it (rather than the deferred update calling openConnection). So it's really just two anti-patterns.

In https://gerrit.wikimedia.org/r/c/mediawiki/core/+/519305 , DBConnRef will re-select the DB as needed.

There are only six files (excluding CLI and tests) in deployed MW code that contain calls to reuseConnection: Codesearch. These seem straight-forward from what I could see, e.g. get a IDatabase handle, not stored in objects or passed to outside code, make a few queries, then call reuseConnection.

Change 756748 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/extensions/BounceHandler@master] Switch to using LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/756748

Change 756748 merged by jenkins-bot:

[mediawiki/extensions/BounceHandler@master] Switch to using LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/756748

Change 756898 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] Convert various callers to LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/756898

Change 756925 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: deprecate non-DBConnRef methods in ConnectionManager

https://gerrit.wikimedia.org/r/756925

Change 756898 merged by jenkins-bot:

[mediawiki/core@master] Convert various callers to LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/756898

Change 759634 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] Convert various callers to LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/759634

Change 759642 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] Convert WikiPage callers to LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/759642

Change 759634 merged by jenkins-bot:

[mediawiki/core@master] Convert various callers to LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/759634

Change 759642 merged by jenkins-bot:

[mediawiki/core@master] Convert WikiPage callers to LoadBalancer::getConnectionRef()

https://gerrit.wikimedia.org/r/759642

Change 756925 merged by jenkins-bot:

[mediawiki/core@master] rdbms: deprecate non-DBConnRef methods in ConnectionManager

https://gerrit.wikimedia.org/r/756925

Logstash query message:("metawiki" OR "centralauth") on the mediawik-errors dashboard for the past 30 days:

CountMessage
565Wikimedia\Rdbms\DBConnectionError: Cannot access the database: Unknown database 'metawiki'
470UnexpectedValueException: Got connection to 'metawiki', but expected 'centralauth'

The first one, are from /wiki/Special:UserRights on meta.wikimedia.org:

#3 includes/user/UserGroupManager.php(812): LoadBalancer->getConnectionRef(integer)
#4 includes/deferred/MWCallableUpdate.php(38): UserGroupManager->MediaWiki\User\{closure}()
#5 includes/deferred/DeferredUpdates.php(536): MWCallableUpdate->doUpdate()
#6 includes/deferred/DeferredUpdates.php(419): DeferredUpdates::attemptUpdate(MWCallableUpdate, LBFactoryMulti)
#7 includes/deferred/DeferredUpdates.php(229): DeferredUpdates::run(MWCallableUpdate, ..)

The second one is more diverse but they seem to have in common that they are WANObjectCache closures for pre-emptive regeneration.

From T255493:

Change 767769 merged:

[mediawiki/core@master] rdbms: Replace getConnection with getLazyConnectionRef

https://gerrit.wikimedia.org/r/767769

With the above merged and deployed to all wikis, I checked Logstash again. And unfortunately it is still happening.

This confirms my previous comments T193565#7647058, in which I suggested it can't be due to (mis)use of DBConnRef, because the code in question is not using closures to retain db handles, is not using raw/non-ref db handles, and more generall not doing unusual that I could see.

Taking a closer look at a another sample in Logstash:

server_name: meta.wikimedia.org
POST /wiki/Special:UserRights
referer: https://meta.wikimedia.org/wiki/Special:UserRights/ **** @zhwiki

Cannot access the database: Unknown database 'metawiki' (db1122)
trace
…
#4 /srv/mediawiki/php-1.39.0-wmf.12/includes/user/UserGroupManager.php(832): Wikimedia\Rdbms\DBConnRef->timestamp()
#5 /srv/mediawiki/php-1.39.0-wmf.12/includes/deferred/MWCallableUpdate.php(38): …
#6 /srv/mediawiki/php-1.39.0-wmf.12/includes/deferred/DeferredUpdates.php(535): MWCallableUpdate->doUpdate()
…
#12 /srv/mediawiki/php-1.39.0-wmf.12/includes/MediaWiki.php(1123): DeferredUpdates::doUpdates()
#13 /srv/mediawiki/php-1.39.0-wmf.12/includes/MediaWiki.php(848): MediaWiki->restInPeace()
#14 /srv/mediawiki/php-1.39.0-wmf.12/includes/MediaWiki.php(591): MediaWiki->doPostOutputShutdown()
#15 /srv/mediawiki/php-1.39.0-wmf.12/index.php(50): MediaWiki->run()
#16 /srv/mediawiki/php-1.39.0-wmf.12/index.php(46): wfIndexMain()

And looking at the associated code in UserGroupManager.php:

UserGroupManager
public function addUserToGroup(  ) {
  
  $dbr = $this->loadBalancer->getConnectionRef( DB_REPLICA );
   $dbr->timestamp() 

Let's take this at face value first. The error states that metawiki does not exist on the database host. This is an error reported by the mysql client from the db server-side — not out own Rdbms assertion logic. Thus:

  • our own assertion logic is either bypassed or passing fine,
  • we are querying the actual db host,
  • the dbname genuinely doesn't there.

There is no such thing in Rbdms as changing which physical database section (or even host in a section) a db handle is referring to, only the selected dbname within a given dbhost is mutable state for the connection.

So, what options does that leave us with?

  1. misconfiguration - unlikely given low trickle and metawiki generally reachable.
  2. mysql/mariadb server bug - unlikely given only very specific code paths are affected. If there was a mariadb bug that somehow caused an entire database to very rarely be reported as non-existent, then I'd expet a more randomly distributed ocurrance, not just metawiki and not just this code path. Perhaps some connection attributes we use could cause it for this code path, but I'm not considering this possiblity right now, mainly because there's no precedent for mysql having such bug to my knowledge, and because "assume/rule out your own mistakes" first.
  3. bug in our Rdbms library or invalid use of the Rdbms API - this seemed the most likely to me and has been what we have explored on this task for two years, e.g. we're accidentally mixing up a connection object somehow and passing it out. However, we found no evidence of this in the Rbms library, and also did not find any invalid use. E.g. wrongly calling selectDomain() or retaining a dbhandle across a closure or some such.
  4. Actual and deterministic mistake in what database we are selecting in the affected feature (e.g. UserGroupManager.php).

Number 4 seems unlikely gvien the low trickle of incidents. If it was deterministic and the feature simply broken, we'd see it more often. Right?

Well, turns out not. Let's take another look at UserGroupManager.php with more context:

UserGroupManager.php
class UserGroupManager {
	public function __construct() {
		$this->loadBalancer = $loadBalancerFactory->getMainLB( $dbDomain );
		
	}

	public function addUserToGroup() {
		
		$dbw = $this->loadBalancer->getConnectionRef( DB_PRIMARY, [], $this->dbDomain );
		$dbw->insert();

		DeferredUpdates 
			$dbr = $this->loadBalancer->getConnectionRef( DB_REPLICA );
			$dbr->timestamp();
	}

When we're acting from Meta-Wiki on a foreign wiki, such in the above Logstash sample – which acts as zhwiki, this LoadBalancer instance will be for that database section. As all LB instances, the default domain is based on service wiring and site configuration, and so without passing it, this means $dbr will always be for metawiki, which will never exist on this LoadBalancer object, unless the target dbdomain happens to be on the same section (e.g. when operating on the local wiki, or cross-wiki to another s7-hosted wiki).

And yes, that means this code is nearly deterministically broken, and the low trickle of events is merely a reflection of global user right changes being an important but not continously used feature. We see this automatically reported about 1-5x a day in Logstash, which is nearly all cross-wiki user rights changes.

So why haven't we received reports about this feature being broken? Well, the $dbrobject in question is used soley within a post-send deferred update which means the critical portion of the userrights change does succeed first, but this update indeed is broken.

And the data supports that, e.g. channel:DeferredUpdates AND exception.trace:UserGroupManager:

From "mediawiki" dashboard in Logstash:
  • Deferred update 'MWCallableUpdate_MediaWiki\User\UserGroupManager->addUserToGroup' failed to run.

The responsibility of this deferred update, is to to queue a UserGroupExpiryJob job. So the concrete impact is that the push for this job is not reached, and thus does not happen. At first I figured this doens't matter much because there are plenty of non-cross wiki user right changes to make sure this maintenance job still runs often enough. The job itself is stateless after all.

But.. the problem gets worse. While the job is stateless, jobs are specific to a wiki. Given that for small wikis most if not all user right changes happen via Meta-Wiki, this probably means they never get cleaned up.

At least the, currently unreachable, code for pushing the job is written correctly. That is, it is injected a JobQueueGroup for the destination wiki (not the current wiki per-se), and would indeed correctly be submitting a job to run there — if the DB query didn't cause a fatal error first.

Change 793508 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] UserGroupManager: Fix dbDomain in addUserToGroup() deferred update

https://gerrit.wikimedia.org/r/793508

Change 793508 merged by jenkins-bot:

[mediawiki/core@master] UserGroupManager: Fix dbDomain in addUserToGroup() deferred update

https://gerrit.wikimedia.org/r/793508

Change 793164 had a related patch set uploaded (by Zabe; author: Krinkle):

[mediawiki/core@REL1_38] UserGroupManager: Fix dbDomain in addUserToGroup() deferred update

https://gerrit.wikimedia.org/r/793164

Change 793165 had a related patch set uploaded (by Zabe; author: Krinkle):

[mediawiki/core@REL1_37] UserGroupManager: Fix dbDomain in addUserToGroup() deferred update

https://gerrit.wikimedia.org/r/793165

Change 793164 merged by jenkins-bot:

[mediawiki/core@REL1_38] UserGroupManager: Fix dbDomain in addUserToGroup() deferred update

https://gerrit.wikimedia.org/r/793164

Change 793165 merged by jenkins-bot:

[mediawiki/core@REL1_37] UserGroupManager: Fix dbDomain in addUserToGroup() deferred update

https://gerrit.wikimedia.org/r/793165

Thank you so much for the through investigation @Krinkle!
Would it make sense to deprecate taking domain in LB or getConnection (but not both obviously)? This would simplify our rdbms code a lot and makes it easier to stop such issues.

There are still a few left... I guess this has now turned into a tracking task but I think this is the last one, given the samples remaining are small enough in number over the last 7 days that I checked each's strack trace and they're the same. Looks like this one is actually an Rdbms bug (again).

From the Logstash "mediawiki" dashboard.

[{reqId}] {exception_url} UnexpectedValueException: Got connection to 'metawiki', but expected 'centralauth' 

from /srv/mediawiki/php-1.39.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1269)
#0 /srv/mediawiki/php-1.39.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(954): LoadBalancer->getForeignConnection()
…
#7 /srv/mediawiki/php-1.39.0-wmf.14/includes/libs/objectcache/wancache/WANObjectCache.php(2727): WANObjectCache->fetchOrRegenerate()
#8 /srv/mediawiki/php-1.39.0-wmf.14/includes/deferred/MWCallableUpdate.php(38): WANObjectCache->{closure}()
…
#17 /srv/mediawiki/php-1.39.0-wmf.14/api.php(125): MediaWiki->doPostOutputShutdown()
#18 /srv/mediawiki/php-1.39.0-wmf.14/api.php(45): wfApiMain()

Then searching by any of these identical traces by reqId, shows the same request previous encountered another exception such as this one:

from /srv/mediawiki/php-1.39.0-wmf.14/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(59)
…
#4 /srv/mediawiki/php-1.39.0-wmf.14/includes/libs/rdbms/database/Database.php(2772): Database->completeCriticalSection()
#5 /srv/mediawiki/php-1.39.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1226): Database->selectDomain()
#6 /srv/mediawiki/php-1.39.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(954): LoadBalancer->getForeignConnection()
…
#10 /srv/mediawiki/php-1.39.0-wmf.14/extensions/GlobalUserPage/includes/GlobalUserPage.php(234): DBConnRef->selectRow()
…
#13 /srv/mediawiki/php-1.39.0-wmf.14/includes/HookContainer/HookContainer.php(338): MediaWiki\GlobalUserPage\Hooks::onTitleIsAlwaysKnown(Title, NULL)
…
#20 /srv/mediawiki/php-1.39.0-wmf.14/includes/parser/Parser.php(723): Parser->internalParse(string)
…
#38 /srv/mediawiki/php-1.39.0-wmf.14/includes/api/ApiEditPage.php(516): EditPage->attemptSave(NULL)
…
#43 /srv/mediawiki/php-1.39.0-wmf.14/api.php(45): wfApiMain()
#45 {main}

As I understand it, the critical section applied to the Rdbms/Database class last year was basically insuffient.

The critical section as it stands ensures (in theory) that the state of the Database object remains consistent - when we send a selectDomain query we also get to update our internal state reflecting that context change.

But, it does not wrap around the state change in the LoadBalancer class, which in most cases is the initiator of the Database-selectDomain call via LoadBalancer-getForeignConnection. This leaves the load balancer in a bad state, and thus makes it able to then issue the unused connection as if it was already connected to a certain domain.

I previously scoped out the refactor of T226595 from this task, focussing on the imediate issue first. Now that we know the critical section was insufficient, we need to either expand it to cover this as well, or to get rid of this state.

Aaron's refactor of T226595 is, while fairly large, fairly minimal within the goal of removing the need for this state. I'll take as my assignment for this task to review the code proposed for T226595.

Change 813337 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/OAuthRateLimiter@master] Get db connection to correct domain for central wiki

https://gerrit.wikimedia.org/r/813337

Change 813341 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/WikimediaMaintenance@master] deleteWiki: Get db connection to correct domain

https://gerrit.wikimedia.org/r/813341

Change 813341 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] deleteWiki: Get db connection to correct domain

https://gerrit.wikimedia.org/r/813341

Change 813337 merged by jenkins-bot:

[mediawiki/extensions/OAuthRateLimiter@master] Get db connection to correct domain for central wiki

https://gerrit.wikimedia.org/r/813337

Mentioned in SAL (#wikimedia-releng) [2022-07-27T03:48:09Z] <Krinkle> Click "Disable publishing" for a dozen repos created recently, including OAuthRateLimiter, ref T143162, T193565