read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

For the network maintenance T187960: Rack/cable/configure asw2-a-eqiad switch stack we had to put s2 on read only for 15 minutes T217441: 15min read-only on some wikis for network maintenance on 2019-03-19
During that time, we saw a bunch of errors, which was kind of expected, but they are not too meaningful:

[XIzdWgpAIDEAADDJM5AAAACW] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError from line 1193 of /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.0.110)

#0 /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php(751): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.33.0-wmf.21/includes/page/WikiPage.php(485): Wikimedia\Rdbms\LoadBalancer->getConnection(integer)
#2 /srv/mediawiki/php-1.33.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(149): WikiPage->loadPageData(integer)
#3 /srv/mediawiki/php-1.33.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#4 /srv/mediawiki/php-1.33.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(65): RefreshLinksJob->run()
#5 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)
#6 {main}


[XIzdWgpAIDEAADDJM5AAAACW] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError from line 1193 of /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.0.110)

exception.file	       	/srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php:1193
exception.message	       	Cannot access the database: Unknown error (10.64.0.110)
host	       	mw1336

Related Objects

Mentioned In: T248481: Mysterious replication lag observed by MW in Codfw
Blog Post: Production Excellence #9: March 2019
Mentioned Here: T248481: Mysterious replication lag observed by MW in Codfw
T218005: Variable from InitialiseSettings can be undefined (corrupt opcache?)
T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)
T204154: Kafka JobQueue should respect DB readonly mode
T216496: Misleading "replica catching up" error when master DB is down
T187960: Rack/cable/configure asw2-a-eqiad switch stack
T217441: 15min read-only on some wikis for network maintenance on 2019-03-19

Event Timeline

Marostegui created this task.Mar 19 2019, 3:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2019, 3:42 PM

I think all error messages need to be revisited - Theyre not very descriptive. (See T216496)

Krinkle moved this task from Untriaged to Sep2019/1.34.wmf.21+ on the Wikimedia-production-error board.Mar 21 2019, 8:07 PM

What's the request? To have a nicer error message? Or avoid the error?

• kchapman edited projects, added Platform Team Legacy, WMF-JobQueue; removed Platform Engineering.Mar 26 2019, 2:16 PM

• kchapman moved this task from Inbox to Watching / External on the Platform Team Legacy board.

• kchapman edited projects, added Platform Team Legacy (Watching / External); removed Platform Team Legacy.

In T218692#5057858, @daniel wrote:

What's the request? To have a nicer error message? Or avoid the error?

I think it is a mix of both if possible :-), if we can investigate what was the error and get it fixed, that'd be nice, but also in case it cannot be avoided or cannot be fixed easily, probably showing a nicer error message can help to troubleshoot further issues
Thanks!

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Apr 3 2019, 2:04 AM

So there are two issues here:

When the MediaWiki database is in read-only mode, jobs must not be run. Otherwise, the jobs' writes would fail with a fatal error, then the job runs out of retries (if any are allowed) and the scheduled work and parameters lost indefinitely (e.g. sending a newsletter, deleting a page, clearing a watchlist, etc.)
When in web execution context (outside jobs, assuming point 1 is fixed), and read-only mode is encountered, rdbms should offer a descriptive error that identifies the problem as being read-only.

In T218692#5126035, @Krinkle wrote:

When the MediaWiki database is in read-only mode, jobs must not be run. Otherwise, the jobs' writes would fail with a fatal error, then the job runs out of retries (if any are allowed) and the scheduled work and parameters lost indefinitely (e.g. sending a newsletter, deleting a page, clearing a watchlist, etc.)

This case is already handled by the JobQueue (cf. T204154: Kafka JobQueue should respect DB readonly mode): when the database layer spews a Wikimedia\Rdbms\DBReadOnlyError, the job runners hold off for 45 seconds and the job is then re-enqueued (without affecting the retry count, so all jobs will be retried indefinitely if the failure is caused by the DB being in read-only mode). However, the problem here is that the underlying database layer does not signal to the job runners that the DB is in read-only mode.

Krinkle mentioned this in Blog Post: Production Excellence #9: March 2019.Apr 21 2019, 6:51 PM

Thanks @mobrovac, so next step here is figuring out why we got "Unknown error" instead of DBReadOnlyError. Tagging Performance Team for that.

The underlying issue has since subsided (the db is no longer read-only and the message is no longer being logged). Moving to Meta for improving this in the future.

Krinkle claimed this task.May 6 2019, 8:00 PM

Krinkle triaged this task as Medium priority.

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

@Marostegui Does this happen everytime we go read-only, or only this time? Do you know if MW was also set to read-only from wmf-config for this, or was it a case of detecting read-only at run-time?

In T218692#5161637, @Krinkle wrote:

@Marostegui Does this happen everytime we go read-only, or only this time? Do you know if MW was also set to read-only from wmf-config for this, or was it a case of detecting read-only at run-time?

I haven't seen this until the last master failover (I could have missed it on earlier ones, not sure).
For a master failover we set both MW and MySQL as read only.
Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure

In T218692#5163379, @Marostegui wrote:

Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure.

Yeah, if not already, this should be a mandatory step because there can be several reasons for the MW config to not apply (yet). Such as:

Job runners that are still active with a job from before the deploy (e.g. transcoding jobs may take several hours).
Depooled server missing the sync.
Rsync failing during the deployment and skipping 1 server.
Known bugs such as T221347 and T218005.

In T218692#5164605, @Krinkle wrote:

In T218692#5163379, @Marostegui wrote:

Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure.

Yeah, if not already, this should be a mandatory step because there can be several reasons for the MW config to not apply (yet).

Yeah, it is mandatory :)

Krinkle reassigned this task from Krinkle to aaron.May 7 2019, 9:43 PM

Not sure what to do with this. It seems like some kind of connectivity problem, not just the server being read-only.

Might be useful to briefly test setting read_only=ON for a bit too see if anything happens again. I don't see why that alone would cause this.

Maybe we can do it in codfw and then test via mwdebug?

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM

aaron moved this task from Doing (old) to Backlog: Maintenance, non-prioritized on the Performance-Team board.Sep 30 2019, 8:41 PM

Krinkle removed a project: Wikimedia-production-error.Oct 12 2019, 11:28 PM

Krinkle moved this task from Untriaged to Core on the WMF-JobQueue board.Mar 6 2020, 11:12 PM

In T218692#5290071, @Krinkle wrote:

Maybe we can do it in codfw and then test via mwdebug?

Marking as stalled until we can schedule together to simulate this issue again to investigate it together.

Sure, we can try to coordinate and test it on codfw with mwdebug indeed sometime together

Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.Mar 9 2020, 7:24 AM

Krinkle claimed this task.Mar 17 2020, 7:09 PM

Krinkle added a subscriber: aaron.

Krinkle mentioned this in T248481: Mysterious replication lag observed by MW in Codfw.Mar 25 2020, 2:50 PM

This is now confirmed to be fixed. With @Marostegui we confirmed that the Codfw DB master is read_only, and then on mwdebug2001.codfw we set wgReadOnly=false locally so that we can let MediaWiki find out about the read-only state by itself.

And when it did, upon write queries, they are correctly throwing a DBReadOnly error, and result in the expected error message in the expected places.

For example,

The edit page form shows the DB is read-only and says why (maintenance, master db is read-only).
internal write attempts from deferred updates (such via ResourceLoader) are reported to Logstash with a "master db is read-only" error message.

During the investigation we saw another bug, which is reported now as T248481.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Hi guy's did you find out what's the solution for the problem you mentionned ?

I'm trying to install Universal language Selector and before that I've install Babel and I'm trying to launch the command line :

php update.php

"Wikimedia\Rdbms\DBConnectionError from line 1460 of C:\wamp64\www\polkadotpreprod\includes\libs\rdbms\loadbalancer\LoadBalancer.php: Cannot access the database:"

Mediawiki 1.38
Thanks in advance :)

@bitcoinwikiguy: That's unrelated - different reason but same output message. Please ask on https://www.mediawiki.org/wiki/Project:Support_desk - thanks.

read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"Closed, ResolvedPublicPRODUCTION ERRORActions

Description

Related Objects

Event Timeline

read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"
Closed, ResolvedPublicPRODUCTION ERROR
Actions