Page MenuHomePhabricator

Wikimedia\\Rdbms\\DBTransactionError(code: 0): A database transaction round is pending.
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
[2020-03-18 16:24:51] exception.ERROR: [85bdd9b31365fadef455fbc9] /w/api.php   Wikimedia\Rdbms\DBTransactionError from line 393 of /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php: A database transaction round is pending. {"exception":"[object] (Wikimedia\\Rdbms\\DBTransactionError(code: 0): A database transaction round is pending. at /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php:393)
Impact
  • Huge amounts of log spam.
  • Some post-edit updates are not done??
  • Updating translatewiki.net codebase is hard due to frequent merge conflicts with reverts.

Details

Request URL
/w/api.php
Stack Trace
[2020-03-18 16:24:51] exception.ERROR: [85bdd9b31365fadef455fbc9] /w/api.php   Wikimedia\Rdbms\DBTransactionError from line 393 of /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php: A database transaction round is pending. {"exception":"[object] (Wikimedia\\Rdbms\\DBTransactionError(code: 0): A database transaction round is pending. at /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php:393)
[stacktrace]
#0 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/RefreshSecondaryDataUpdate.php(97): DeferredUpdates::attemptUpdate(Object(LinksUpdate), Object(Wikimedia\\Rdbms\\LBFactorySimple))
#1 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php(417): RefreshSecondaryDataUpdate->doUpdate()
#2 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php(296): DeferredUpdates::attemptUpdate(Object(RefreshSecondaryDataUpdate), Object(Wikimedia\\Rdbms\\LBFactorySimple))
#3 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php(233): DeferredUpdates::run(Object(RefreshSecondaryDataUpdate), Object(Wikimedia\\Rdbms\\LBFactorySimple), Object(Monolog\\Logger), Object(BufferingStatsdDataFactory), 'post')
#4 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/deferred/DeferredUpdates.php(150): DeferredUpdates::handleUpdateQueue(Array, 'run', 2)
#5 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/MediaWiki.php(1059): DeferredUpdates::doUpdates('run')
#6 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/MediaWiki.php(839): MediaWiki->restInPeace()
#7 /srv/mediawiki/tags/2020-03-18_15:30:30/includes/MediaWiki.php(848): MediaWiki->{closure}()
#8 /srv/mediawiki/tags/2020-03-18_15:30:30/api.php(119): MediaWiki->doPostOutputShutdown()
#9 {main}

Event Timeline

It seems to be happening after translation edits. I'll try to debug a bit more.

So DeferredUpdates has this code:

$ticket = $lbFactory->getEmptyTransactionTicket( __METHOD__ );
if ( !$ticket || $lbFactory->hasTransactionRound() ) {
        throw new DBTransactionError( null, "A database transaction round is pending." );
}

Based on my debugging $lbFactory->hasTransactionRound() is falsy, and so is $ticket. I stopped digging further why ticket is falsy.

Also at the time the error happens:

[id:Wikimedia\Rdbms\LBFactory:private] => 2106802960
[ticket:Wikimedia\Rdbms\LBFactory:private] => 2028367884
[trxRoundId:Wikimedia\Rdbms\LBFactory:private] => 
[trxRoundStage:Wikimedia\Rdbms\LBFactory:private] => cursory

In any case, reverting 1f4efc6 makes the exception go away.

Let me know if you need further debugging from my side.

For sanity, adding as blocker for the train:

git branch --contains 1f4efc6 --all
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
  remotes/origin/wmf/1.35.0-wmf.24
Jdforrester-WMF triaged this task as Unbreak Now! priority.Mar 18 2020, 6:30 PM

I tried to reproduce this on test.wikipedia.org and check Logstash, but did not see anything in Logstash. But Translate seems to be somewhat broken there (T157997) so cannot still say whether this will affect production.

Doesn't seem to be obviously Translate related, as editing a normal page (such as Project:News) makes the exception show up. Could be a difference in extensions or configuration that is causing this on translatewiki.net but not on testwiki.

Is that a train blocker for Wikimedia, it seems to be for translatewiki. Though we can dig in logstash to check whether it is happening there as well.

So far, we've been unable to reproduce this on testwiki. Unless this error does start showing up in Wikimedia production, I don't think this needs to be a train blocker.


@Nikerabbit did a little debugging via IRC (#mediawiki).

  • $ticket is falsey. That's either because mt_rand() is reliably returning 0, or some DB connection is returning true from writesOrCallbacksPending()
    • LBFactory::getEmptyTransactionTicket() returns null when ::hasMasterChanges() returns true
    • LBFactory::hasMasterChanges() returns true when any LoadBalancer::hasMasterChanges() returns true.
    • LoadBalancer::hasMasterChanges() returns true when any master connection Database::writesOrCallbacksPending() returns true.
    • Database::writesOrCallbacksPending() returns true if there's an open transaction and at least one of several fields is truthy.

We looked at a log of SQL statements (P10724) to see if there were any obvious open transactions, but the transactions in there seem balanced. At this point we ran out of time and ideas to look at further.

Thanks @Anomie for the check.

The error appears on translatewiki which runs out of master. So maybe that is an error that has been introduced after I have branched 1.35.0-wmf.24 on Tuesday.

For now, I am removing it from the list of blockers, but we can keep it in mind as we promote to group0/group1 etc.

It seems it is up to three patches now, which is not sustainable. I'll give another shot at debugging this, but will probably need more help.

I made this trace:

["BSTART"]
["D",1,false,[],[],[[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null],[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null]],[]]
["D",0,false,[],[],[],[]]
["C",true,{}]
["D",1,false,[],[],[[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null],[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null]],[]]
["D",0,false,[],[],[],[]]
["D",1,false,[],[],[[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null],[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null]],[]]
["D",0,false,[],[],[],[]]
["C",true,{}]
["D",1,false,[],[],[[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null],[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null]],[]]
["D",0,false,[],[],[],[]]
["A",1872741175,true]
["BEND",null]

Using this patch:

diff --git a/includes/deferred/DeferredUpdates.php b/includes/deferred/DeferredUpdates.php
index 42086395bc..5338ca80a7 100644
--- a/includes/deferred/DeferredUpdates.php
+++ b/includes/deferred/DeferredUpdates.php
@@ -388,8 +388,11 @@ class DeferredUpdates {
         * @since 1.34
         */
        public static function attemptUpdate( DeferrableUpdate $update, ILBFactory $lbFactory ) {
+               file_put_contents( '/home/nike/debug', json_encode( [ 'BSTART' ] ), FILE_APPEND );
+
                $ticket = $lbFactory->getEmptyTransactionTicket( __METHOD__ );
                if ( !$ticket || $lbFactory->hasTransactionRound() ) {
+                       file_put_contents( '/home/nike/debug', json_encode( [ 'BEND', $ticket ] ), FILE_APPEND );
                        throw new DBTransactionError( null, "A database transaction round is pending." );
                }
 
diff --git a/includes/libs/rdbms/database/Database.php b/includes/libs/rdbms/database/Database.php
index bfbcbd8f80..3b9e26d814 100644
--- a/includes/libs/rdbms/database/Database.php
+++ b/includes/libs/rdbms/database/Database.php
@@ -647,6 +647,9 @@ abstract class Database implements IDatabase, IMaintainableDatabase, LoggerAware
        }
 
        public function writesOrCallbacksPending() {
+               file_put_contents( '/home/nike/debug', json_encode( [ 'D', $this->trxLevel(), 
+                       $this->trxDoneWrites, $this->trxIdleCallbacks, $this->trxPreCommitCallbacks, 
+                               $this->trxEndCallbacks, $this->trxSectionCancelCallbacks ] ),FILE_APPEND );
                return $this->trxLevel() && (
                        $this->trxDoneWrites ||
                        $this->trxIdleCallbacks ||
diff --git a/includes/libs/rdbms/lbfactory/LBFactory.php b/includes/libs/rdbms/lbfactory/LBFactory.php
index 291a50a1f2..1465e17824 100644
--- a/includes/libs/rdbms/lbfactory/LBFactory.php
+++ b/includes/libs/rdbms/lbfactory/LBFactory.php
@@ -387,6 +387,7 @@ abstract class LBFactory implements ILBFactory {
        public function hasMasterChanges() {
                $ret = false;
                $this->forEachLB( function ( ILoadBalancer $lb ) use ( &$ret ) {
+                       file_put_contents( '/home/nike/debug', json_encode( [ 'C', $lb->hasMasterChanges(), $lb ] ), FILE_APPEND );
                        $ret = $ret || $lb->hasMasterChanges();
                } );
 
@@ -489,6 +490,8 @@ abstract class LBFactory implements ILBFactory {
 
        public function getEmptyTransactionTicket( $fname ) {
                if ( $this->hasMasterChanges() ) {
+                       file_put_contents( '/home/nike/debug', json_encode( [ 'A', $this->ticket, $this->hasMasterChanges() ] ), FILE_APPEND );
+
                        $this->queryLogger->error(
                                __METHOD__ . ": $fname does not have outer scope",
                                [ 'trace' => ( new RuntimeException() )->getTraceAsString() ]

Basically this means that trxEndCallbacks is not empty.

The docs for TransactionalCallableUpdate class say:

* Extends DeferredCallableUpdate to allow handling of transaction related tasks
* or isolations to ensure an undisturbed update process before and after
* MediaWiki::preOutputCommit.

Searching for cancelOnRollback results multiple hits in MediaWiki core and in SMW, but I assume it would be the one in TransactionalCallableUpdate. Here are traces for TransactionalCallableUpdate constructor:

MediaWiki->run
MediaWiki->doPostOutputShutdown
MediaWiki->{closure}
MediaWiki->restInPeace
DeferredUpdates::doUpdates
DeferredUpdates::handleUpdateQueue
DeferredUpdates::run
DeferredUpdates::attemptUpdate
RefreshSecondaryDataUpdate->doUpdate
MediaWiki\\Storage\\DerivedPageDataUpdater->getSecondaryDataUpdates
LinksUpdate->__construct
Hooks::run
Hooks::callHook
SMW\\MediaWiki\\Hooks->onLinksUpdateConstructed
SMW\\MediaWiki\\Hooks\\LinksUpdateConstructed->process
SMW\\ParserData->updateStore
SMW\\DataUpdater->doUpdate
SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::newUpdate
SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate->__construct
MediaWiki->run
MediaWiki->doPostOutputShutdown
MediaWiki->{closure}
MediaWiki->restInPeace
DeferredUpdates::doUpdates
DeferredUpdates::handleUpdateQueue
DeferredUpdates::run
DeferredUpdates::attemptUpdate
RefreshSecondaryDataUpdate->doUpdate
MediaWiki\\Storage\\DerivedPageDataUpdater->getSecondaryDataUpdates
AbstractContent->getSecondaryDataUpdates # <----------------------------- This line is not present in the earlier trace (not sure if it has any significance)
LinksUpdate->__construct
Hooks::run
Hooks::callHook
SMW\\MediaWiki\\Hooks->onLinksUpdateConstructed
SMW\\MediaWiki\\Hooks\\LinksUpdateConstructed->process
SMW\\ParserData->updateStore
SMW\\DataUpdater->doUpdate
SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::newUpdate
SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate->__construct"

Someone more knowledgeable on this system than me needs to figure out how this can be solved.

["D",1,false,[],[],[[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null],[[{},"cancelOnRollback"],"SMW\\MediaWiki\\Deferred\\TransactionalCallableUpdate::__construct",null]],[]]

[...]
Searching for cancelOnRollback results multiple hits in MediaWiki core and in SMW, but I assume it would be the one in TransactionalCallableUpdate.

Yes, that's a good assumption. It's helpful that the DB methods take an $fname and that it's included in the log you added, making it easy to identify the source.

AbstractContent->getSecondaryDataUpdates # <----------------------------- This line is not present in the earlier trace (not sure if it has any significance)

Probably not significant, it's just that there's two code paths that can construct a LinksUpdate object.


My best guess as to what's happening is:

  1. RefreshSecondaryDataUpdate->doUpdate() is called without a transaction active.
  2. Something inside the call to $this->updater->getSecondaryDataUpdates() starts an implicit transaction (P10724 line 561)
  3. SMW's TransactionalCallableUpdate registers its callback, which makes the implicit transaction no longer "empty".
  4. RefreshSecondaryDataUpdate calls DeferredUpdates::attemptUpdate(), which blows up because it can't get the ticket because the transaction isn't empty.

Without SMW, step #3 doesn't happen so the implicit transaction remains empty. And presumably nothing else in Wikimedia production currently does anything similar.

SMW isn't really doing anything really wrong here, it was trying to make sure that its update wouldn't get run if the save got rolled back. OTOH, it's pointless after rMW1f4efc6c34aa: Add RefreshSecondaryDataUpdate and use it in DerivedPageDataUpdater because its update isn't even created now until well after the save has already happened (or been rolled back).

Probably the fix is to have RefreshSecondaryDataUpdate do what DerivedPageDataUpdater used to do in non-defer mode and commit transactions before it starts trying to run the sub-updates.

Change 587788 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] RefreshSecondaryDataUpdate: Commit before running sub-updates

https://gerrit.wikimedia.org/r/587788

Change 587788 merged by jenkins-bot:
[mediawiki/core@master] RefreshSecondaryDataUpdate: Commit before running sub-updates

https://gerrit.wikimedia.org/r/587788

I temporarily reverted our changes and then deployed the changes from the patch, and it seems to be working well. We will deploy the patch on the server next Wednesday.

We've deployed this change on twn, and haven't noticed any issues as of yet. Will resolve after reviewing the logs a little longer.

Logs look clean. Resolving this issue. Thanks for the help.