Page MenuHomePhabricator

DBReadOnlyRoleError: Database is read-only: Server is configured as a read-only static clone database.
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBReadOnlyRoleError: Database is read-only: Server is configured as a read-only static clone database.
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(953)
#0 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(1259): Wikimedia\Rdbms\Database->assertIsWritablePrimary()
#1 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/DatabaseMysqlBase.php(1125): Wikimedia\Rdbms\Database->executeQuery(string, string, integer)
#2 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/database/Database.php(4800): Wikimedia\Rdbms\DatabaseMysqlBase->doFlushSession(string)
#3 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1937): Wikimedia\Rdbms\Database->flushSession(string, integer)
#4 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2240): Wikimedia\Rdbms\LoadBalancer->Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\DatabaseMysqli)
#5 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1938): Wikimedia\Rdbms\LoadBalancer->forEachOpenPrimaryConnection(Closure)
#6 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(250): Wikimedia\Rdbms\LoadBalancer->flushPrimarySessions(string, integer)
#7 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(233): Wikimedia\Rdbms\LBFactory::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#8 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(252): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#9 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(346): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#10 /srv/mediawiki/php-1.39.0-wmf.8/includes/exception/MWExceptionHandler.php(138): Wikimedia\Rdbms\LBFactory->flushPrimarySessions(string)
#11 /srv/mediawiki/php-1.39.0-wmf.8/includes/exception/MWExceptionHandler.php(203): MWExceptionHandler::rollbackPrimaryChangesAndLog(Flow\Exception\InvalidActionException, string)
#12 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(585): MWExceptionHandler::handleException(Flow\Exception\InvalidActionException, string)
#13 /srv/mediawiki/php-1.39.0-wmf.8/index.php(50): MediaWiki->run()
#14 /srv/mediawiki/php-1.39.0-wmf.8/index.php(46): wfIndexMain()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}

Additionally, this seems to trigger:

normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-session')
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(829)
#0 /srv/mediawiki/php-1.39.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(291): Wikimedia\Rdbms\LBFactory->assertTransactionRoundStage(string)
#1 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(1111): Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string)
#2 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(848): MediaWiki->restInPeace()
#3 /srv/mediawiki/php-1.39.0-wmf.8/includes/MediaWiki.php(591): MediaWiki->doPostOutputShutdown()
#4 /srv/mediawiki/php-1.39.0-wmf.8/index.php(50): MediaWiki->run()
#5 /srv/mediawiki/php-1.39.0-wmf.8/index.php(46): wfIndexMain()
#6 /srv/mediawiki/w/index.php(3): require(string)
#7 {main}
Impact
Notes

~64 instances of this in the last day, all for 1.39.0-wmf.8 (T305214).

It seems like this is the result of something erroneously pointing at a permanently read-only database (ExternalStore), so probably a mistake in configuration?

Details

Request URL
https://ca.wikipedia.org/wiki/[page]

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Tgr added subscribers: Marostegui, Tgr.

First appearance is last Wednesday (Apr 20) 12:03 UTC (logstash), logged a few hundred times since. The matching SAL record is Repooling after maintenance db1119 by @Marostegui, not sure if it's related. It doesn't match any application code or config change.

Thanks for the ping. I have never seen this error before Server is configured as a read-only static clone database.. db1119 is a s1 slave and it's been never been configured as any other thing.
The SAL thing is probably just a coincidence as normally during the day we do dozen of pooling/repoolings so it is easy that anything can pretty much any of those.

From what I can see on logstash it is complaining about cawiki which doesn't live on db1119 as it lives on s7.
If there is a way to find which database it was specifically complaining about, I could take a look at the problematic host.

From the traceback this issue is obvious. The code is trying to write to a replica and mw's internal locks goes kaboom to avoid that (because yes, people have mistakenly ran write queries against replicas and caused data corruption before). The good thing is that it doesn't seem to be an issue with individual code. It's coming from forEachOpenPrimaryConnection so probably a maint script didn't get the updated config after a master switchover.

For context: In wikimedia/rdbms lingo, static refers to one or more db hosts that are logically neither primary nor replica, e.g. the older locked and frozen ExternalStore shards that are considered done and we no longer append new data to them.

The traceback involves Flow, which afaik indeed uses ExternalStore for its blobs. Hope that helps.

Then ignore my comment, my bad.

Continuing to see this at a low level in 1.39.0-wmf.9.

I don't see a recent change to Flow in the Flow repository or wmf-config that would explain this, uses x1 and es more or less as-is, it doesn't specify individual servers afaik.

At this point I suspect a change or pre-existing race bug through Rdbms to have caused the wrong server to be operated on. (Maybe with distant relation to bug T193565.)

Another theory might be that db-configuration has wrongly changed to add a non-ES host to ES group (ExternalStore). This seems somewhat unlikely as it would presumably be more common and deterministic in that case, but it may be worth passing over to a DBA to evaluate if we ruled out Rdbms as cause.

I might be very wrong here so take it with a grain of salt. MWExceptionHandler::rollbackPrimaryChangesAndLog(Flow\Exception\InvalidActionException, string) starts the issue. I guess Flow throws an exception for whatever reason, the Rdbms tries to rollback and just sending the rollback command causes the second issue as it should not try to write on a read-only host. Probably rdbms doesn't know this doesn't need a rollback specially given the fact that it runs it on all dbs it has connection to (LBFactoryMulti->forEachLB(Closure, array) I guess).

Ack, so this is not an issue with how Flow uses Rdbms, but a problem in Rdbms.

Krinkle reassigned this task from Krinkle to aaron.
Krinkle triaged this task as High priority.
Krinkle moved this task from Inbox to Doing: Prio Interrupt on the Performance-Team board.

Change 801826 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: avoid DBReadOnlyRoleError in Database::doFlushSession()

https://gerrit.wikimedia.org/r/801826

Change 801826 merged by jenkins-bot:

[mediawiki/core@master] rdbms: avoid DBReadOnlyRoleError in Database::doFlushSession()

https://gerrit.wikimedia.org/r/801826