Page MenuHomePhabricator

Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.23/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1694)
#0 /srv/mediawiki/php-1.39.0-wmf.23/includes/libs/rdbms/lbfactory/LBFactory.php(324): Wikimedia\Rdbms\LoadBalancer->approvePrimaryChanges(array, string)
#1 /srv/mediawiki/php-1.39.0-wmf.23/includes/MediaWiki.php(671): Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string, array)
#2 /srv/mediawiki/php-1.39.0-wmf.23/includes/api/ApiMain.php(901): MediaWiki::preOutputCommit(DerivativeContext)
#3 /srv/mediawiki/php-1.39.0-wmf.23/includes/api/ApiMain.php(846): ApiMain->executeActionWithErrorHandling()
#4 /srv/mediawiki/php-1.39.0-wmf.23/api.php(90): ApiMain->execute()
#5 /srv/mediawiki/php-1.39.0-wmf.23/api.php(45): wfApiMain()
#6 /srv/mediawiki/w/api.php(3): require(string)
#7 {main}
Impact
Notes

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Database_error — user report on en.wiki at 2022-08-16T04:10:00.000Z

[c42d457a-4fe1-4b6f-b69f-0cf0ad0ab0b9] 2022-08-16 04:08:56: Fatal exception of type "Wikimedia\Rdbms\DBTransactionError"

Related Objects

StatusSubtypeAssignedTask
Resolvedaaron
Resolvedjijiki
Resolvedaaron
ResolvedKrinkle
Resolvedtstarling
Resolvedaaron
ResolvedPRODUCTION ERRORtstarling
Resolvedtstarling
Resolvedtstarling
Resolvedaaron
Resolvedaaron
ResolvedKrinkle
Resolvedaaron
ResolvedEevans
Resolvedaaron
ResolvedKrinkle
ResolvedPapaul
Resolved Marostegui
Resolvedaaron
ResolvedKrinkle
Resolvedtstarling
Resolvedtstarling
ResolvedPRODUCTION ERRORjcrespo
Resolvedtstarling
ResolvedKrinkle
ResolvedEevans
Resolvedtstarling
ResolvedLadsgroup
Resolvedtstarling
ResolvedVgutierrez
Resolvedtstarling
Resolvedaaron
Resolvedjcrespo
Resolvedjcrespo
Resolved Cmjohnson
ResolvedGehel
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolved Marostegui
DuplicateNone
Resolvedjcrespo
Resolvedaaron
Resolvedjcrespo
Resolvedtstarling
Resolvedtstarling
Resolvedaaron
ResolvedKrinkle

Event Timeline

04:46 < cwhite> Explicit transaction still active errors have stopped

colewhite assigned this task to jcrespo.
colewhite subscribed.

Recovery appears to correlate with restoration of X2 primary db in eqiad. Optimistically resolving.

tstarling triaged this task as High priority.
tstarling added subscribers: aaron, tstarling.

Any kind of problem with x2 was supposed to cause graceful degradation in MediaWiki. The remaining issue is the fact that replication failure leads to an uncaught exception from approvePrimaryChanges().

Change 823763 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] objectcache: Add trace to SqlBagOStuff DBError logging

https://gerrit.wikimedia.org/r/823763

Change 823763 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Add trace to SqlBagOStuff DBError logging

https://gerrit.wikimedia.org/r/823763

Change 823791 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] SqlBagOStuff: use cancelAtomic()

https://gerrit.wikimedia.org/r/823791

Change 823791 merged by jenkins-bot:

[mediawiki/core@master] SqlBagOStuff: use cancelAtomic()

https://gerrit.wikimedia.org/r/823791

Preliminary incident report created at https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_x2_databases_replication_breakage CC @lmata

To concrete gaps in my knowledge that I left blank:

  • Who called Amir?
  • Detection- I wasn't around at that time (was it purely page-based, were user reports fundamental, was logstash/mw error rate what alerted SREs?

CC @colewhite

@jcrespo thank you for the report! looking forward to our review later, much appreciated.

I checked operations logs and I have answers to my previous questions:

Who called Amir?

He was woken up by the manual page by @colewhite.

Detection- I wasn't around at that time (was it purely page-based, were user reports fundamental, was logstash/mw error rate what alerted SREs?

Icinga was taken seriously right away and procedure was followed. @TheresNoTime brought later user reports to the attention of SREs (which is helpful to have another way to verify user impact) but SREs were already working by then.

Updating report.

Change 824434 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.39.0-wmf.25] SqlBagOStuff: use cancelAtomic()

https://gerrit.wikimedia.org/r/824434

Change 824434 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.25] SqlBagOStuff: use cancelAtomic()

https://gerrit.wikimedia.org/r/824434

Mentioned in SAL (#wikimedia-operations) [2022-08-19T01:30:59Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.25/includes/libs/rdbms/database/DBConnRef.php: fix potential mainstash exception file 1 T315274 (duration: 03m 21s)

Mentioned in SAL (#wikimedia-operations) [2022-08-19T01:37:59Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix potential mainstash exception file 2 T315274 (duration: 03m 30s)

My reason for reopening this task is resolved. The patch has been deployed, so a similar failure in future should not lead to user-visible exceptions.