Page MenuHomePhabricator

DBSessionStateError: Cannot execute query from MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock while session status is ERROR
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBSessionStateError: Cannot execute query from MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock while session status is ERROR
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/TransactionManager.php(224)
#0 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(1532): Wikimedia\Rdbms\TransactionManager->assertSessionStatus(Wikimedia\Rdbms\DatabaseMysqli, string)
#1 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(1205): Wikimedia\Rdbms\Database->assertQueryIsCurrentlyAllowed(string, string)
#2 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/DatabaseMysqlBase.php(1076): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(5376): Wikimedia\Rdbms\DatabaseMysqlBase->doUnlock(string, string)
#4 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(5421): Wikimedia\Rdbms\Database->unlock(string, string)
#5 [internal function]: Wikimedia\Rdbms\Database->Wikimedia\Rdbms\{closure}()
#6 /srv/mediawiki/php-1.39.0-wmf.9/vendor/wikimedia/scoped-callback/src/ScopedCallback.php(96): call_user_func_array(Closure, array)
#7 /srv/mediawiki/php-1.39.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(230): Wikimedia\ScopedCallback->__destruct()
#8 /srv/mediawiki/php-1.39.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(163): RefreshLinksJob->runForTitle(Title)
#9 /srv/mediawiki/php-1.39.0-wmf.9/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#10 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#11 {main}
exception.previous.trace
from /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(1777)
#0 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(1384): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(1296): Wikimedia\Rdbms\Database->executeQueryAttempt(string, string, boolean, string, integer)
#2 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(1208): Wikimedia\Rdbms\Database->executeQuery(string, string, integer)
#3 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(4712): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/database/Database.php(4684): Wikimedia\Rdbms\Database->doCommit(string)
#5 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1780): Wikimedia\Rdbms\Database->commit(string, string)
#6 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2240): Wikimedia\Rdbms\LoadBalancer->Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\DatabaseMysqli)
#7 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1785): Wikimedia\Rdbms\LoadBalancer->forEachOpenPrimaryConnection(Closure)
#8 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/lbfactory/LBFactory.php(250): Wikimedia\Rdbms\LoadBalancer->commitPrimaryChanges(string, integer)
#9 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(230): Wikimedia\Rdbms\LBFactory::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#10 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/lbfactory/LBFactory.php(252): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#11 /srv/mediawiki/php-1.39.0-wmf.9/includes/libs/rdbms/lbfactory/LBFactory.php(315): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#12 /srv/mediawiki/php-1.39.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(230): Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string)
#13 /srv/mediawiki/php-1.39.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(163): RefreshLinksJob->runForTitle(Title)
#14 /srv/mediawiki/php-1.39.0-wmf.9/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#15 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#16 {main}
Impact
Notes

These seem to show up in 1.39.0-wmf.8 (T305214). 64 instances since 21st April.

It's possible we discussed/filtered this one from mediawiki-new-errors in the triage meeting this morning; but I don't specifically remember and can't now find anything.

@Ladsgroup may have thoughts?

Event Timeline

This is an issue inside rdbms lib in the sense that the query errors for whatever reason (and we do have a log for that) but rdbms can't properly clean up connections and unlock tables, etc.

The actual error is simply connection going away: https://logstash.wikimedia.org/goto/1b85deab10c11c5bb5e130eb711cc9de

DBQueryDisconnectedError: A connection error occurred during a query. Query: COMMIT Function: RefreshLinksJob::runForTitle Error: 2006 MySQL server has gone away (db1104)

Interesting that the ping() call in approvePrimaryChanges() didn't prevent this. Probably due to the $conn->writesOrCallbacksPending() check, whereas commitPrimaryChanges() doesn't not have such a check (for safety and to flush point-in-time reads for primaries). I wonder why read-only master handle is being obtained. Perhaps it is only used for the getScopedLockAndFlush() call or there were no link table changes to make. GET_LOCK does not count as a "write query", so the ping() would not happen pre-commit. In any case, a disconnect is not considered silently recoverable since a lock was supposed to be held in LinksUpdate::acquirePageLock(). The server connection is dropping due to slow parsing in RefreshLinksJob::runForTitle. Some of these parses are extremely slow:

Parsing Wikidata:Database reports/Constraint violations/Mandatory constraints/Violations was slow, took 148.47 seconds

Also, I see that WRITE_ALLOW_SEGMENTS does get used for the preprocessor cache entry. Hence:

Memcached error for key "WANCache:wikidatawiki:preprocess-hash:a1585af71dae7a2edd292696464af34be4f4b09e:0|#|v" on server "127.0.0.1:11213": ITEM TOO BIG

The multiple parses are consistent with these being wikibase refreshlinks jobs for client wiki pages linking to repo entities (the only RefreshLinksJobs that batch titles AFAIK). The batching is not an issue here.

Making RefreshLinksJob boost the net timeouts could work:

$dbw->setSessionOptions( [ 'connTimeout' => 300 ] );

This probably could be done on all jobs. Refreshlinks is showing up because it has higher frequency of runs.

Change 798889 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: make approvePrimaryChanges() trigger ping() in more cases

https://gerrit.wikimedia.org/r/798889

Change 798892 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: set idle timeouts for CLI/job mode more uniformly

https://gerrit.wikimedia.org/r/798892

Change 798889 merged by jenkins-bot:

[mediawiki/core@master] rdbms: make approvePrimaryChanges() trigger ping() in more cases

https://gerrit.wikimedia.org/r/798889

I suspect there may be a bug in Rdbms unrelated to the fact that an idle connection is lost. Connection loss is rare, but when it happens, the failure I would expect would be something like DBQueryDisconnectedError or something else that we can trace to either caller cause or backend cause.

Instead the exception we see is DBSessionStateError which is, to my knowledge, an error that is meant to indidate a bug in calling code. However given the calling code looks fine here, I guess it is a case where Rdbms is mismanaging its own state and thus throwing DBSessionStateError.

Below is a example for an instance of this issue.

[ffce20f9-4636-4702-90eb-0658d83abed4] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBSessionStateError: Cannot execute query from CategoryMembershipChangeJob::run while session status is ERROR

Messages received in the "mediawiki" Logstash dashboard for reqId:ffce20f9-4636-4702-90eb-0658d83abed4

mw1425:/w/api.phpSep 8, 2022 @ 21:32:05.934DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected
mw1425:/w/api.phpSep 8, 2022 @ 21:32:42.224DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1174 with error 2006; reconnected
mw1425:/w/api.phpSep 8, 2022 @ 21:34:01.401DBPerformanceTransaction spent 0.377ms in writes, under the 3s limit
mw1310: jobrunnerSep 8, 2022 @ 21:35:38.759DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:37:13.832DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:40:42.114DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected`
mw1310: jobrunnerSep 8, 2022 @ 21:42:32.668DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:45:10.906DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1101:3317 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:45:12.666DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:47:50.914DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1107 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:47:57.787DBConnectionSilent reconnection to db1163 could not be attempted: session state loss (named lock 'enwiki:CategoryMembershipChange:36395484')
mw1310: jobrunnerSep 8, 2022 @ 21:47:57.799DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1163 with error 2006; reconnected
mw1310: jobrunnerSep 8, 2022 @ 21:47:57.802DBQueryError 2006 from Wikimedia\Rdbms\Database::ping, MySQL server has gone away SELECT 1 AS ping db1163
mw1310: jobrunnerSep 8, 2022 @ 21:47:57.804exceptionWikimedia\Rdbms\DBSessionStateError: Cannot execute query from CategoryMembershipChangeJob::run while session status is ERROR
mw1310: jobrunnerSep 8, 2022 @ 21:47:57.805exceptionWikimedia\Rdbms\DBQueryDisconnectedError: A connection error occurred during a query. Query: SELECT 1 AS ping Function: Wikimedia\Rdbms\Database::ping Error: 2006 MySQL server has gone away
mw1310: jobrunnerSep 8, 2022 @ 21:47:57.826DBConnectionWikimedia\Rdbms\Database::attemptQuery: lost connection to db1101:3317 with error 2006; reconnected

Still seeing this error as of 2022-09-29; happens about 3× per hour

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBSessionStateError: Cannot execute query from CategoryMembershipChangeJob::run while session status is ERROR
exception.trace
from /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/TransactionManager.php(221)
#0 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(1387): Wikimedia\Rdbms\TransactionManager->assertSessionStatus(Wikimedia\Rdbms\DatabaseMysqli, string)
#1 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(936): Wikimedia\Rdbms\Database->assertQueryIsCurrentlyAllowed(string, string)
#2 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/DatabaseMysqlBase.php(997): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(3415): Wikimedia\Rdbms\DatabaseMysqlBase->doUnlock(string, string)
#4 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(3455): Wikimedia\Rdbms\Database->unlock(string, string)
#5 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(2498): Wikimedia\Rdbms\Database->Wikimedia\Rdbms\{closure}(integer, Wikimedia\Rdbms\DatabaseMysqli)
#6 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1879): Wikimedia\Rdbms\Database->runOnTransactionIdleCallbacks(integer, array)
#7 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(469): Wikimedia\Rdbms\LoadBalancer->runPrimaryTransactionIdleCallbacks(string)
#8 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(444): Wikimedia\Rdbms\LBFactory->executePostTransactionCallbacks()
#9 /srv/mediawiki/php-1.40.0-wmf.2/includes/exception/MWExceptionHandler.php(161): Wikimedia\Rdbms\LBFactory->rollbackPrimaryChanges(string)
#10 /srv/mediawiki/php-1.40.0-wmf.2/extensions/EventBus/includes/JobExecutor.php(112): MWExceptionHandler::rollbackPrimaryChangesAndLog(Wikimedia\Rdbms\DBQueryDisconnectedError)
#11 /srv/mediawiki/rpc/RunSingleJob.php(77): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#12 {main}
exception.previous.trace
from /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(1616)
#0 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(1235): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(1103): Wikimedia\Rdbms\Database->attemptQuery(array, array, string, string, boolean, boolean)
#2 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(940): Wikimedia\Rdbms\Database->executeQuery(string, string, integer, string)
#3 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/database/Database.php(3014): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1775): Wikimedia\Rdbms\Database->ping()
#5 /srv/mediawiki/php-1.40.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(414): Wikimedia\Rdbms\LoadBalancer->approvePrimaryChanges(array, string)
#6 /srv/mediawiki/php-1.40.0-wmf.2/extensions/EventBus/includes/JobExecutor.php(287): Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string, array)
#7 /srv/mediawiki/php-1.40.0-wmf.2/extensions/EventBus/includes/JobExecutor.php(81): MediaWiki\Extension\EventBus\JobExecutor->commitPrimaryChanges(Wikimedia\Rdbms\LBFactoryMulti, string)
#8 /srv/mediawiki/rpc/RunSingleJob.php(77): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#9 {main}

Change 798892 abandoned by Aaron Schulz:

[mediawiki/core@master] rdbms: set idle timeouts for CLI/job mode more uniformly

Reason:

https://gerrit.wikimedia.org/r/798892

Seems to be fixed judging from logs this week.

I see two unrelated DBSessionStateError instances from DatabaseMysqlBase::serverIsReadOnly.