Page MenuHomePhabricator

read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"
Closed, ResolvedPublicPRODUCTION ERROR

Description

For the network maintenance T187960: Rack/cable/configure asw2-a-eqiad switch stack we had to put s2 on read only for 15 minutes T217441: 15min read-only on some wikis for network maintenance on 2019-03-19
During that time, we saw a bunch of errors, which was kind of expected, but they are not too meaningful:

[XIzdWgpAIDEAADDJM5AAAACW] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError from line 1193 of /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.0.110)
#0 /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php(751): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.33.0-wmf.21/includes/page/WikiPage.php(485): Wikimedia\Rdbms\LoadBalancer->getConnection(integer)
#2 /srv/mediawiki/php-1.33.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(149): WikiPage->loadPageData(integer)
#3 /srv/mediawiki/php-1.33.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#4 /srv/mediawiki/php-1.33.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(65): RefreshLinksJob->run()
#5 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)
#6 {main}


[XIzdWgpAIDEAADDJM5AAAACW] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError from line 1193 of /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.0.110)
exception.file	       	/srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php:1193
exception.message	       	Cannot access the database: Unknown error (10.64.0.110)
host	       	mw1336

Event Timeline

I think all error messages need to be revisited - Theyre not very descriptive. (See T216496)

What's the request? To have a nicer error message? Or avoid the error?

What's the request? To have a nicer error message? Or avoid the error?

I think it is a mix of both if possible :-), if we can investigate what was the error and get it fixed, that'd be nice, but also in case it cannot be avoided or cannot be fixed easily, probably showing a nicer error message can help to troubleshoot further issues
Thanks!

So there are two issues here:

  1. When the MediaWiki database is in read-only mode, jobs must not be run. Otherwise, the jobs' writes would fail with a fatal error, then the job runs out of retries (if any are allowed) and the scheduled work and parameters lost indefinitely (e.g. sending a newsletter, deleting a page, clearing a watchlist, etc.)
  2. When in web execution context (outside jobs, assuming point 1 is fixed), and read-only mode is encountered, rdbms should offer a descriptive error that identifies the problem as being read-only.
  1. When the MediaWiki database is in read-only mode, jobs must not be run. Otherwise, the jobs' writes would fail with a fatal error, then the job runs out of retries (if any are allowed) and the scheduled work and parameters lost indefinitely (e.g. sending a newsletter, deleting a page, clearing a watchlist, etc.)

This case is already handled by the JobQueue (cf. T204154: Kafka JobQueue should respect DB readonly mode): when the database layer spews a Wikimedia\Rdbms\DBReadOnlyError, the job runners hold off for 45 seconds and the job is then re-enqueued (without affecting the retry count, so all jobs will be retried indefinitely if the failure is caused by the DB being in read-only mode). However, the problem here is that the underlying database layer does not signal to the job runners that the DB is in read-only mode.

Thanks @mobrovac, so next step here is figuring out why we got "Unknown error" instead of DBReadOnlyError. Tagging Performance Team for that.

Krinkle moved this task from Sep2019/1.34.wmf.21+ to Mar 2021 on the Wikimedia-production-error board.

The underlying issue has since subsided (the db is no longer read-only and the message is no longer being logged). Moving to Meta for improving this in the future.

Krinkle triaged this task as Medium priority.
Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

@Marostegui Does this happen everytime we go read-only, or only this time? Do you know if MW was also set to read-only from wmf-config for this, or was it a case of detecting read-only at run-time?

@Marostegui Does this happen everytime we go read-only, or only this time? Do you know if MW was also set to read-only from wmf-config for this, or was it a case of detecting read-only at run-time?

I haven't seen this until the last master failover (I could have missed it on earlier ones, not sure).
For a master failover we set both MW and MySQL as read only.
Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure

Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure.

Yeah, if not already, this should be a mandatory step because there can be several reasons for the MW config to not apply (yet). Such as:

  • Job runners that are still active with a job from before the deploy (e.g. transcoding jobs may take several hours).
  • Depooled server missing the sync.
  • Rsync failing during the deployment and skipping 1 server.
  • Known bugs such as T221347 and T218005.

Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure.

Yeah, if not already, this should be a mandatory step because there can be several reasons for the MW config to not apply (yet).

Yeah, it is mandatory :)

Not sure what to do with this. It seems like some kind of connectivity problem, not just the server being read-only.

Might be useful to briefly test setting read_only=ON for a bit too see if anything happens again. I don't see why that alone would cause this.

Maybe we can do it in codfw and then test via mwdebug?

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM
Krinkle changed the task status from Open to Stalled.Mar 7 2020, 12:57 AM
Krinkle added projects: Wikimedia-Incident, DBA.

Maybe we can do it in codfw and then test via mwdebug?

Marking as stalled until we can schedule together to simulate this issue again to investigate it together.

Sure, we can try to coordinate and test it on codfw with mwdebug indeed sometime together

Krinkle closed this task as Resolved.EditedMar 25 2020, 2:52 PM

This is now confirmed to be fixed. With @Marostegui we confirmed that the Codfw DB master is read_only, and then on mwdebug2001.codfw we set wgReadOnly=false locally so that we can let MediaWiki find out about the read-only state by itself.

And when it did, upon write queries, they are correctly throwing a DBReadOnly error, and result in the expected error message in the expected places.

For example,

  • The edit page form shows the DB is read-only and says why (maintenance, master db is read-only).
  • internal write attempts from deferred updates (such via ResourceLoader) are reported to Logstash with a "master db is read-only" error message.

During the investigation we saw another bug, which is reported now as T248481.

Hi guy's did you find out what's the solution for the problem you mentionned ?

I'm trying to install Universal language Selector and before that I've install Babel and I'm trying to launch the command line :

php update.php


"Wikimedia\Rdbms\DBConnectionError from line 1460 of C:\wamp64\www\polkadotpreprod\includes\libs\rdbms\loadbalancer\LoadBalancer.php: Cannot access the database:"

Mediawiki 1.38
Thanks in advance :)

@bitcoinwikiguy: That's unrelated - different reason but same output message. Please ask on https://www.mediawiki.org/wiki/Project:Support_desk - thanks.