Page MenuHomePhabricator

read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"
Open, NormalPublic

Description

For the network maintenance T187960: Rack/cable/configure asw2-a-eqiad switch stack we had to put s2 on read only for 15 minutes T217441: 15min read-only on some wikis for network maintenance on 2019-03-19
During that time, we saw a bunch of errors, which was kind of expected, but they are not too meaningful:

[XIzdWgpAIDEAADDJM5AAAACW] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError from line 1193 of /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.0.110)
#0 /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php(751): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.33.0-wmf.21/includes/page/WikiPage.php(485): Wikimedia\Rdbms\LoadBalancer->getConnection(integer)
#2 /srv/mediawiki/php-1.33.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(149): WikiPage->loadPageData(integer)
#3 /srv/mediawiki/php-1.33.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#4 /srv/mediawiki/php-1.33.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(65): RefreshLinksJob->run()
#5 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)
#6 {main}


[XIzdWgpAIDEAADDJM5AAAACW] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError from line 1193 of /srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.0.110)
exception.file	       	/srv/mediawiki/php-1.33.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php:1193
exception.message	       	Cannot access the database: Unknown error (10.64.0.110)
host	       	mw1336

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2019, 3:42 PM

I think all error messages need to be revisited - Theyre not very descriptive. (See T216496)

daniel added a subscriber: daniel.Mar 26 2019, 2:14 PM

What's the request? To have a nicer error message? Or avoid the error?

What's the request? To have a nicer error message? Or avoid the error?

I think it is a mix of both if possible :-), if we can investigate what was the error and get it fixed, that'd be nice, but also in case it cannot be avoided or cannot be fixed easily, probably showing a nicer error message can help to troubleshoot further issues
Thanks!

So there are two issues here:

  1. When the MediaWiki database is in read-only mode, jobs must not be run. Otherwise, the jobs' writes would fail with a fatal error, then the job runs out of retries (if any are allowed) and the scheduled work and parameters lost indefinitely (e.g. sending a newsletter, deleting a page, clearing a watchlist, etc.)
  2. When in web execution context (outside jobs, assuming point 1 is fixed), and read-only mode is encountered, rdbms should offer a descriptive error that identifies the problem as being read-only.
  1. When the MediaWiki database is in read-only mode, jobs must not be run. Otherwise, the jobs' writes would fail with a fatal error, then the job runs out of retries (if any are allowed) and the scheduled work and parameters lost indefinitely (e.g. sending a newsletter, deleting a page, clearing a watchlist, etc.)

This case is already handled by the JobQueue (cf. T204154: Kafka JobQueue should respect DB readonly mode): when the database layer spews a Wikimedia\Rdbms\DBReadOnlyError, the job runners hold off for 45 seconds and the job is then re-enqueued (without affecting the retry count, so all jobs will be retried indefinitely if the failure is caused by the DB being in read-only mode). However, the problem here is that the underlying database layer does not signal to the job runners that the DB is in read-only mode.

Thanks @mobrovac, so next step here is figuring out why we got "Unknown error" instead of DBReadOnlyError. Tagging Performance Team for that.

Krinkle moved this task from Sep 2019 / 1.34wmf21-25 to Meta on the Wikimedia-production-error board.

The underlying issue has since subsided (the db is no longer read-only and the message is no longer being logged). Moving to Meta for improving this in the future.

Krinkle claimed this task.May 6 2019, 8:00 PM
Krinkle triaged this task as Normal priority.
Krinkle moved this task from Inbox to Doing on the Performance-Team board.

@Marostegui Does this happen everytime we go read-only, or only this time? Do you know if MW was also set to read-only from wmf-config for this, or was it a case of detecting read-only at run-time?

@Marostegui Does this happen everytime we go read-only, or only this time? Do you know if MW was also set to read-only from wmf-config for this, or was it a case of detecting read-only at run-time?

I haven't seen this until the last master failover (I could have missed it on earlier ones, not sure).
For a master failover we set both MW and MySQL as read only.
Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure

Krinkle added a comment.EditedMay 7 2019, 3:02 PM

Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure.

Yeah, if not already, this should be a mandatory step because there can be several reasons for the MW config to not apply (yet). Such as:

  • Job runners that are still active with a job from before the deploy (e.g. transcoding jobs may take several hours).
  • Depooled server missing the sync.
  • Rsync failing during the deployment and skipping 1 server.
  • Known bugs such as T221347 and T218005.

Typically after deploying MW as read only, we do a set global read_only=ON on the master's mysql prompt to be double sure.

Yeah, if not already, this should be a mandatory step because there can be several reasons for the MW config to not apply (yet).

Yeah, it is mandatory :)

Krinkle reassigned this task from Krinkle to aaron.May 7 2019, 9:43 PM
aaron added a comment.Jun 27 2019, 8:34 AM

Not sure what to do with this. It seems like some kind of connectivity problem, not just the server being read-only.

Might be useful to briefly test setting read_only=ON for a bit too see if anything happens again. I don't see why that alone would cause this.

Maybe we can do it in codfw and then test via mwdebug?

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM