Page MenuHomePhabricator

Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::selectDomain critical section while session state is out of sync.
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::selectDomain critical section while session state is out of sync.

A critical section from Wikimedia\Rdbms\Database::executeQuery has failed
#0 /srv/me
exception.trace
from /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/database/Database.php(3390)
#0 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/database/Database.php(1779): Wikimedia\Rdbms\Database->commenceCriticalSection(string)
#1 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1077): Wikimedia\Rdbms\Database->selectDomain(Wikimedia\Rdbms\DatabaseDomain)
#2 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/loadbalancer/LoadBalancer.php(935): Wikimedia\Rdbms\LoadBalancer->reuseOrOpenConnectionForNewRef(integer, Wikimedia\Rdbms\DatabaseDomain, integer)
#3 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/loadbalancer/LoadBalancer.php(912): Wikimedia\Rdbms\LoadBalancer->getServerConnection(integer, string, integer)
#4 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/database/DBConnRef.php(103): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(integer, array, string, integer)
#5 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/database/DBConnRef.php(117): Wikimedia\Rdbms\DBConnRef->ensureConnection()
#6 /srv/mediawiki/php-1.40.0-wmf.19/includes/libs/rdbms/database/DBConnRef.php(698): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.40.0-wmf.19/includes/Revision/RevisionStore.php(2699): Wikimedia\Rdbms\DBConnRef->timestamp(string)
#8 /srv/mediawiki/php-1.40.0-wmf.19/includes/Revision/RevisionStore.php(2732): MediaWiki\Revision\RevisionStore->getRelativeRevision(MediaWiki\Revision\RevisionStoreRecord, integer, string)
#9 /srv/mediawiki/php-1.40.0-wmf.19/extensions/Echo/includes/DiscussionParser.php(34): MediaWiki\Revision\RevisionStore->getPreviousRevision(MediaWiki\Revision\RevisionStoreRecord)
#10 /srv/mediawiki/php-1.40.0-wmf.19/extensions/Echo/includes/Hooks.php(659): EchoDiscussionParser::generateEventsForRevision(MediaWiki\Revision\RevisionStoreRecord, boolean)
#11 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Extension\Notifications\Hooks::MediaWiki\Extension\Notifications\{closure}()
#12 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/DeferredUpdates.php(473): MWCallableUpdate->doUpdate()
#13 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/DeferredUpdates.php(398): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#14 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/DeferredUpdates.php(213): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, MediaWiki\JobQueue\JobQueueGroupFactory, string)
#15 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/DeferredUpdatesScope.php(267): DeferredUpdates::{closure}(MWCallableUpdate, integer)
#16 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/DeferredUpdatesScope.php(196): DeferredUpdatesScope->processStageQueue(integer, integer, Closure)
#17 /srv/mediawiki/php-1.40.0-wmf.19/includes/deferred/DeferredUpdates.php(234): DeferredUpdatesScope->processUpdates(integer, Closure)
#18 /srv/mediawiki/php-1.40.0-wmf.19/includes/MediaWiki.php(1115): DeferredUpdates::doUpdates()
#19 /srv/mediawiki/php-1.40.0-wmf.19/includes/MediaWiki.php(846): MediaWiki->restInPeace()
#20 /srv/mediawiki/php-1.40.0-wmf.19/api.php(125): MediaWiki->doPostOutputShutdown()
#21 /srv/mediawiki/php-1.40.0-wmf.19/api.php(45): wfApiMain()
#22 /srv/mediawiki/w/api.php(3): require(string)
#23 {main}
Impact

Unclear.

Notes

~160 of these in the last 2 weeks, seems to have increased somewhat in 1.40.0-wmf.19, although that may be coincidental.

Details

Request URL
https://en.wikipedia.org/w/api.php

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
brennen renamed this task from Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::selectDomain critical section while session state is out of sync.A critical section from Wikimedia\Rdbms\Database::executeQuery has failed to Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::selectDomain critical section while session state is out of sync..Jan 26 2023, 4:42 PM
Krinkle triaged this task as High priority.Jan 30 2023, 7:44 PM

I don't see a problem with selectDomain() itself. It looks like a query in EchoUserLocator::locateArticleCreator was really slow and triggered an excimer timeout.

This happened within DeferredUpdates, which tries to catch errors and move on to other updates, which included more database updates. TimeoutExceptions should probably make DeferredUpdates abort the remaining updates so things can shutdown. The main point of the excimer timeouts is to limit run time, give a chance to log, and maybe give a clean error to the client. Catching/ignoring them undermines some of that and runs the risk of hitting uglier fpm/apache timeouts.

I don't see a problem with selectDomain() itself. It looks like a query in EchoUserLocator::locateArticleCreator was really slow and triggered an excimer timeout.

This happened within DeferredUpdates, which tries to catch errors and move on to other updates, […]. The main point of the excimer timeouts is to limit run time, give a chance to log, and maybe give a clean error to the client. Catching/ignoring them undermines some of that and runs the risk of hitting uglier fpm/apache timeouts.

I think if the main response portion, pre-send update, or earlier post-send update times out, we should still let try to let other updates run, especially ones that might run hooks and/or queue jobs. However, I could be convinced otherwise if that's the only benefit left.

As it stands though, I lean toward not changing it because it seems orthogonal to the observed problem.

  • Even if we stop forego most updates after a timeout in the main portion, there may still be a trivial Rdbms query triggered by the skin or as part of processing a late hook or Job creation that needs some basic data for its parameters. This data is typically cachable in Memcached, but can just as well cache-miss.
  • The problem here seems first and foremost that the Rdbms object is corrupting its state and thus not fulfilling its contract. As part of a related production error task we developed (not insignificant) complexity in the form of RequestTimeout and critical sections, to prevent this. Do we want to make that actually work or should we consider removing that if it isn't working? If remove it, how do we mitigate the production error?

The advantage of excimer timers is that they can happen at anytime in PHP code. It's tricky to avoid corrupting the state of objects though. We really should just be cleaning up and shutting down. If we want to handle it more gracefully in Database, one could:

  • Rely on the fact that Zend C wrapper methods do not get preempted by excimer. DatabaseMysli::doSingleStatementQuery could catch the error, stage it, and have it rethrown (this gets more complicated with nested critical sections and/or when a section include several queries). It seems too messy.
  • DeferredUpdates could use some LBFactory method to discard all DB connections when a timeout exceptions are caught.
  • MWExceptionHandler::rollbackPrimaryChanges could discard such DB connections.

It could also help to:

  • set max_statement_time on the session level (different for web vs job vs cli).
  • set MYSQLI_OPT_READ_TIMEOUT for regular DB connections (different for web vs job vs cli). This would match the idea in CommonSettings.php for $wgHTTPMaxTimeout /$wgHTTPMaxConnectTimeout and $wgRequestTimeLimit;

MWExceptionHandler::rollbackPrimaryChanges could discard such DB connections.

Per todays triage meeting, let's go with this one.

This turns out to be a separate issue.

Change 906451 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: make Database::flushSession() handle critical section errors

https://gerrit.wikimedia.org/r/906451

Change 906451 merged by jenkins-bot:

[mediawiki/core@master] rdbms: make Database::flushSession() handle critical section errors

https://gerrit.wikimedia.org/r/906451