DBReadOnlyRoleError: Database is read-only: Server is configured as a read-only static clone database.
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

Error

mwversion: 1.39.0-wmf.8
reqId: afb4b359-afe6-4e80-b931-46fef57c0ca6
Find reqId in Logstash

normalized_message

[{reqId}] {exception_url}   Wikimedia\Rdbms\DBReadOnlyRoleError: Database is read-only: Server is configured as a read-only static clone database.

exception.trace

from /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(953)
#0 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(1259): Wikimedia\Rdbms\Database->assertIsWritablePrimary()
#1 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/DatabaseMysqlBase.php(1125): Wikimedia\Rdbms\Database->executeQuery(string, string, integer)
#2 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(4800): Wikimedia\Rdbms\DatabaseMysqlBase->doFlushSession(string)
#3 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1937): Wikimedia\Rdbms\Database->flushSession(string, integer)
#4 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2240): Wikimedia\Rdbms\LoadBalancer->Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\DatabaseMysqli)
#5 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1938): Wikimedia\Rdbms\LoadBalancer->forEachOpenPrimaryConnection(Closure)
#6 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(250): Wikimedia\Rdbms\LoadBalancer->flushPrimarySessions(string, integer)
#7 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(233): Wikimedia\Rdbms\LBFactory::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#8 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(252): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#9 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(346): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#10 /srv/mediawiki/php-1.39.0-wmf.8/includes/exception/MWExceptionHandler.php(138): Wikimedia\Rdbms\LBFactory->flushPrimarySessions(string)
#11 /srv/mediawiki/php-1.39.0-wmf.8/includes/exception/MWExceptionHandler.php(203): MWExceptionHandler::rollbackPrimaryChangesAndLog(Flow\Exception\InvalidActionException, string)
#12 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(585): MWExceptionHandler::handleException(Flow\Exception\InvalidActionException, string)
#13 /srv/mediawiki/php-1.39.0-wmf.8/index.php(50): MediaWiki->run()
#14 /srv/mediawiki/php-1.39.0-wmf.8/index.php(46): wfIndexMain()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}

Additionally, this seems to trigger:

normalized_message

[{reqId}] {exception_url}   Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-session')

exception.trace

from /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(829)
#0 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(291): Wikimedia\Rdbms\LBFactory->assertTransactionRoundStage(string)
#1 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(1111): Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string)
#2 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(848): MediaWiki->restInPeace()
#3 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(591): MediaWiki->doPostOutputShutdown()
#4 /srv/mediawiki/php-1.39.0-wmf.8/index.php(50): MediaWiki->run()
#5 /srv/mediawiki/php-1.39.0-wmf.8/index.php(46): wfIndexMain()
#6 /srv/mediawiki/w/index.php(3): require(string)
#7 {main}

Impact

Notes

~64 instances of this in the last day, all for 1.39.0-wmf.8 (T305214).

It seems like this is the result of something erroneously pointing at a permanently read-only database (ExternalStore), so probably a mistake in configuration?

Details

Request URL: https://ca.wikipedia.org/wiki/[page]

	Subject	Repo	Branch	Lines +/-
	rdbms: avoid DBReadOnlyRoleError in Database::doFlushSession()	mediawiki/core	master	+4 -13

Customize query in gerrit

Related Objects

Mentioned In: T310214: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist
Mentioned Here: T193565: Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?)
T305214: 1.39.0-wmf.8 deployment blockers

Event Timeline

brennen created this task.Apr 21 2022, 3:29 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptApr 21 2022, 3:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Krinkle moved this task from Untriaged to Apr 2022 on the Wikimedia-production-error board.Apr 21 2022, 3:29 PM

First appearance is last Wednesday (Apr 20) 12:03 UTC (logstash), logged a few hundred times since. The matching SAL record is Repooling after maintenance db1119 by @Marostegui, not sure if it's related. It doesn't match any application code or config change.

Thanks for the ping. I have never seen this error before Server is configured as a read-only static clone database.. db1119 is a s1 slave and it's been never been configured as any other thing.
The SAL thing is probably just a coincidence as normally during the day we do dozen of pooling/repoolings so it is easy that anything can pretty much any of those.

From what I can see on logstash it is complaining about cawiki which doesn't live on db1119 as it lives on s7.
If there is a way to find which database it was specifically complaining about, I could take a look at the problematic host.

From the traceback this issue is obvious. The code is trying to write to a replica and mw's internal locks goes kaboom to avoid that (because yes, people have mistakenly ran write queries against replicas and caused data corruption before). The good thing is that it doesn't seem to be an issue with individual code. It's coming from forEachOpenPrimaryConnection so probably a maint script didn't get the updated config after a master switchover.

For context: In wikimedia/rdbms lingo, static refers to one or more db hosts that are logically neither primary nor replica, e.g. the older locked and frozen ExternalStore shards that are considered done and we no longer append new data to them.

The traceback involves Flow, which afaik indeed uses ExternalStore for its blobs. Hope that helps.

Then ignore my comment, my bad.

brennen moved this task from Backlog to Logs/Train on the User-brennen board.Apr 25 2022, 10:28 PM

Continuing to see this at a low level in 1.39.0-wmf.9.

I don't see a recent change to Flow in the Flow repository or wmf-config that would explain this, uses x1 and es more or less as-is, it doesn't specify individual servers afaik.

At this point I suspect a change or pre-existing race bug through Rdbms to have caused the wrong server to be operated on. (Maybe with distant relation to bug T193565.)

Another theory might be that db-configuration has wrongly changed to add a non-ES host to ES group (ExternalStore). This seems somewhat unlikely as it would presumably be more common and deterministic in that case, but it may be worth passing over to a DBA to evaluate if we ruled out Rdbms as cause.

I might be very wrong here so take it with a grain of salt. MWExceptionHandler::rollbackPrimaryChangesAndLog(Flow\Exception\InvalidActionException, string) starts the issue. I guess Flow throws an exception for whatever reason, the Rdbms tries to rollback and just sending the rollback command causes the second issue as it should not try to write on a read-only host. Probably rdbms doesn't know this doesn't need a rollback specially given the fact that it runs it on all dbs it has connection to (LBFactoryMulti->forEachLB(Closure, array) I guess).

Ack, so this is not an issue with how Flow uses Rdbms, but a problem in Rdbms.

Restricted Application added a project: Performance-Team. · View Herald TranscriptMay 26 2022, 3:18 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.May 26 2022, 3:19 PM

Krinkle claimed this task.May 31 2022, 6:51 PM

Krinkle reassigned this task from Krinkle to aaron.

Krinkle triaged this task as High priority.

Krinkle moved this task from Inbox, needs triage to Doing: Prio Interrupt on the Performance-Team board.

Change 801826 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: avoid DBReadOnlyRoleError in Database::doFlushSession()

https://gerrit.wikimedia.org/r/801826

gerritbot added a project: Patch-For-Review.May 31 2022, 11:01 PM

Change 801826 merged by jenkins-bot:

[mediawiki/core@master] rdbms: avoid DBReadOnlyRoleError in Database::doFlushSession()

https://gerrit.wikimedia.org/r/801826

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.15; 2022-06-06).Jun 1 2022, 12:00 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 1 2022, 12:30 AM

aaron closed this task as Resolved.Jun 7 2022, 9:48 PM

Krinkle mentioned this in T310214: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist.Jun 8 2022, 7:24 PM

DBReadOnlyRoleError: Database is read-only: Server is configured as a read-only static clone database.Closed, ResolvedPublicPRODUCTION ERRORActions