Page MenuHomePhabricator

Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-callbacks')
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
  • mwversion: 1.37.0-wmf.3
  • reqId: 3c75e13d-35df-4caa-9d8e-a5ab1df5146a
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-callbacks')
exception.trace
from /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(819)
#0 /srv/mediawiki/php-1.37.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(290): Wikimedia\Rdbms\LBFactory->assertTransactionRoundStage(string)
#1 /srv/mediawiki/php-1.37.0-wmf.1/includes/MediaWiki.php(1120): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string)
#2 /srv/mediawiki/rpc/RunSingleJob.php(93): MediaWiki->restInPeace()
#3 {main}
Impact

About 30 in the last 24 hours. Not a train blocker.

Notes

Event Timeline

This was marked as a train blocker because I'm a klutz. Removing it as a train blocker.

Krinkle added a subscriber: Krinkle.

Still seen.

Sample from reqId 20b32cae-c28d-40b8-aeca-8c33abff406b (30 Sep 2021 09:32):

  • Sep 30, 2021 @ 09:32:30.124
message
 /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBQueryDisconnectedError: A connection error occurred during a query. 
Query: ROLLBACK
Function: MWExceptionHandler::rollbackPrimaryChangesAndLog
Error: 2006 MySQL server has gone away (db1122)
trace
from /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/database/Database.php(1807)
#0 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/database/Database.php(1793): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/database/Database.php(1768): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/database/Database.php(1327): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/database/Database.php(5016): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/database/Database.php(4978): Wikimedia\Rdbms\Database->doRollback(string)
#5 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1991): Wikimedia\Rdbms\Database->rollback(string, string)
#6 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2344): Wikimedia\Rdbms\LoadBalancer::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\DatabaseMysqli)
#7 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1992): Wikimedia\Rdbms\LoadBalancer->forEachOpenPrimaryConnection(Closure)
#8 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(249): Wikimedia\Rdbms\LoadBalancer->rollbackPrimaryChanges(string, integer)
#9 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(236): Wikimedia\Rdbms\LBFactory::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#10 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(251): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#11 /srv/mediawiki/php-1.38.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(342): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#12 /srv/mediawiki/php-1.38.0-wmf.1/includes/exception/MWExceptionHandler.php(136): Wikimedia\Rdbms\LBFactory->rollbackPrimaryChanges(string)
#13 /srv/mediawiki/php-1.38.0-wmf.1/extensions/EventBus/includes/JobExecutor.php(112): MWExceptionHandler::rollbackPrimaryChangesAndLog(Wikibase\DataModel\Services\Lookup\EntityLookupException)
#14 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#15 {main}
  • Sep 30, 2021 @ 09:32:30.125
message
/rpc/RunSingleJob.php   Wikibase\DataModel\Services\Lookup\EntityLookupException: The critical section "Wikimedia\Rdbms\Database::executeQuery" timed out after 180 seconds
  • Sep 30, 2021 @ 09:32:30.128
/rpc/RunSingleJob.php   Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::runOnTransactionIdleCallbacks critical section while session state is out of sync.

A critical section from Wikimedia\Rdbms\Database::rollback has failed
#0 /srv/mediawiki/php-1.38.0-wmf.1/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(44): Wikimedia\Rdbms\Database->Wikimedia\Rdbms\{closure}(44)
  • Sep 30, 2021 @ 09:32:30.129
Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback')
  • Sep 30, 2021 @ 09:32:30.237
PHP Fatal error: Uncaught Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::rollback critical section while session state is out of sync.

A critical section from Wikimedia\Rdbms\Database::executeQuery has failed
#0 /srv/mediawiki/php-1.38.0-wmf.1/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(44):

Still seen this week in production. The samples I looked at all issued Database::rollback critical section while session state is out of sync. after a critical section timeout.

Change 735703 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: improve clarity of Database::canRecoverFromDisconnect and add logging

https://gerrit.wikimedia.org/r/735703

Change 735703 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve clarity of Database::canRecoverFromDisconnect and add logging

https://gerrit.wikimedia.org/r/735703

One problem is that RevisionBasedEntityLookup is catching generic Exceptions and throwing a different exception, so any entrypoint logic (e.g. jobrunner, runSingleJob, MediaWiki class, MWExceptionHandler class) won't know that the original error was a timeout (via an interrupt, leaving a bunch of function calls unfinished).

Even if it did bubble up, we'd still need to improve those entry points to handle timeouts better. Perhaps rollbackPrimaryChanges() could also:

It would be nice to have some method to handle this for replica connections too, since they can also timeout on queries.

Also, the Database subclasses could catch timeout exceptions that occur right after the driver call (the pecl extension C methods are not interrupted) and treat it as a non-state-corrupting DBError, while bubbling the error up at the same time. This would avoid a lot of these problem cases.

Using mysql-level query timeouts would also be an option to limit this from occurring (though it could still happen with multiple slow queries).

Change 737824 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] [WIP] rdbms: make automatic connection recovery apply to more cases

https://gerrit.wikimedia.org/r/737824

Change 764359 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Increase logging of DBConnection messages from 'error' to 'warning'

https://gerrit.wikimedia.org/r/764359

Change 764359 merged by jenkins-bot:

[operations/mediawiki-config@master] Expand log level of DBConnection messages from 'error' to 'warning'

https://gerrit.wikimedia.org/r/764359

Change 737824 merged by jenkins-bot:

[mediawiki/core@master] rdbms: make automatic connection recovery apply to more cases

https://gerrit.wikimedia.org/r/737824

I'm seeing this same error locally when testing specific behavior of some REST API endpoint changes for T303352, but having a hard time debugging the path from ParameterAssertionException being thrown here to this error being thrown. I think @aaron's comment above might explain the problem i'm seeing, any chance of helping me step through this particular use case @aaron ? :)

I'm seeing this same error locally when testing specific behavior of some REST API endpoint changes for T303352, but having a hard time debugging the path from ParameterAssertionException being thrown here to this error being thrown. I think @aaron's comment above might explain the problem i'm seeing, any chance of helping me step through this particular use case @aaron ? :)

That might just be T303887.

Related to T303885 ?

Caused by the same thing https://gerrit.wikimedia.org/r/c/mediawiki/core/+/737824/

I'm inclined to revert that patch and follow ups

Change 771081 merged by jenkins-bot:

[mediawiki/core@master] rdbms: fix owner id and RELEASE_ALL_LOCKS query in session flushing methods

https://gerrit.wikimedia.org/r/771081

hashar added a subscriber: hashar.

We had the error again on mediawikiwiki when promoting to 1.39.0-wmf.1:

Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-callbacks') from /rpc/RunSingleJob.php

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-callbacks')
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(818)
#0 /srv/mediawiki/php-1.39.0-wmf.1/includes/libs/rdbms/lbfactory/LBFactory.php(290): Wikimedia\Rdbms\LBFactory->assertTransactionRoundStage(string)
#1 /srv/mediawiki/php-1.39.0-wmf.1/includes/MediaWiki.php(1110): Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string)
#2 /srv/mediawiki/rpc/RunSingleJob.php(93): MediaWiki->restInPeace()
#3 {main}

Fun thing with f01738ac-a46e-4e0f-a8a8-039b77b5ed41: it is an user creating an account on mediawiki.org (1.39.0-wmf.1) with a global account created on meta (1.38.0-wmf.26). The two mediawiki versions surprised me.

[f01738ac-a46e-4e0f-a8a8-039b77b5ed41] /rpc/RunSingleJob.php Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-callbacks')

Before that there is:

[f01738ac-a46e-4e0f-a8a8-039b77b5ed41] /rpc/RunSingleJob.php Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::runOnTransactionIdleCallbacks critical section while session state is out of sync.

And after:

Error 1062 from EchoNotificationMapper::insert, Duplicate entry '17604325-1515299' for key 'PRIMARY' (db1157) INSERT INTO echo_notification (notification_event,notification_user,notification_timestamp,notification_read_timestamp,notification_bundle_hash) VALUES (1515299,17604325,'20220321124005',NULL,'') db1157

Zabe added a subscriber: Zabe.

I am marking this one resolved again. The error message Transaction round stage must be 'cursory' (not 'within-rollback-callbacks') pointed me there but after looking at exceptions and errors for the requests the issues were due to a duplicate primary key for echo notifications. I highly suspect the root cause was T304307.

Change 771755 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add mysql timeout error codes to wasKnownStatementRollbackError()

https://gerrit.wikimedia.org/r/771755

Change 771755 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add mysql timeout error codes to wasKnownStatementRollbackError()

https://gerrit.wikimedia.org/r/771755