Page MenuHomePhabricator

Investigate rare "DBConnectionError: No working replica DB server"
Closed, ResolvedPublic

Description

It seems that sometimes wfGetDB() can throw in a way that doesn't recover by itself.

Request ID: XH6GfwpAMFYAAKFRdRcAAADD
Request URL: /w/load.php?debug=false&lang=zh-hans&modules=...
Log time: 2019-03-05T14:24:08

message
Wikimedia\Rdbms\DBConnectionError: Cannot access the database: No working replica DB server: Unknown error (10.64.16.191)
Stack trace:
#0 /srv/mediawiki/php-1.33.0-wmf.19/includes/libs/rdbms/loadbalancer/LoadBalancer.php(742): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.33.0-wmf.19/includes/GlobalFunctions.php(2654): Wikimedia\Rdbms\LoadBalancer->getConnection()
#2 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoader.php(120): wfGetDB()
#3 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoader.php(747): ResourceLoader->preloadModuleInfo()
#4 /srv/mediawiki/php-1.33.0-wmf.19/load.php(46): ResourceLoader->respond()
#5 /srv/mediawiki/w/load.php(3): include()
#6 {main}

Error is thrown from LoadBalancer.php:740.

Impact

The given instance here is from load.php, which does actually catch and recover from error because (unlike most MW code) RL doesn't actually hard-depend on being able to use the database. (It then logs the above as level=WARNING for informational purposes.)

But RL's handling for this is unique to it, and outside the wfGetDB code path. Until proven otherwise I'm going to assume this is affecting all of MediaWiki and not just RL. And for MW requests, this would be fatal. But, until I find a sample of that, not tagging as Wikimedia-production-error yet.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle renamed this task from Investigate rare "DBConnectionError: No working replica DB server" from wfGetDB (in load.php) to Investigate rare "DBConnectionError: No working replica DB server".Jun 1 2019, 11:27 PM
Krinkle removed a project: MediaWiki-ResourceLoader.

Still seen, though rare (few times per week, sometimes multiple times per day).

  • Request ID: XNHTGgrAIEwAACy47A4AAABY
  • Request URL: enwiki GET /wiki/Special:BlankPage
  • timestamp: 2019-05-07T18:49:23+00:00
DBConnectionError:  Cannot access the database: No working replica DB server: Unknown error (10.192.0.4)

#0 /srv/mediawiki/php-1.34.0-wmf.3/includes/libs/rdbms/loadbalancer/LoadBalancer.php(749): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.34.0-wmf.3/includes/GlobalFunctions.php(2600): Wikimedia\Rdbms\LoadBalancer->getConnection(boolean, array, boolean)
#2 /srv/mediawiki/php-1.34.0-wmf.3/extensions/PageImages/includes/PageImages.php(50): wfGetDB(integer)
#3 /srv/mediawiki/php-1.34.0-wmf.3/extensions/PageImages/includes/PageImages.php(199): PageImages::getPageImage(Title)
#4 /srv/mediawiki/php-1.34.0-wmf.3/includes/Hooks.php(174): PageImages::onBeforePageDisplay(OutputPage, SkinVector)
#5 /srv/mediawiki/php-1.34.0-wmf.3/includes/Hooks.php(234): Hooks::callHook(string, array, array, NULL, string)
#6 /srv/mediawiki/php-1.34.0-wmf.3/includes/OutputPage.php(2755): Hooks::runWithoutAbort(string, array)
#7 /srv/mediawiki/php-1.34.0-wmf.3/includes/MediaWiki.php(872): OutputPage->output(boolean)
#8 /srv/mediawiki/php-1.34.0-wmf.3/includes/MediaWiki.php(884): Closure$MediaWiki::main()
#9 /srv/mediawiki/php-1.34.0-wmf.3/includes/MediaWiki.php(515): MediaWiki->main()
#10 /srv/mediawiki/php-1.34.0-wmf.3/index.php(42): MediaWiki->run()
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM
mobrovac subscribed.

@Krinkle I see you modified wfGetDB() in I556627dc6486e6f6539d1d2dd5aa6f009eff341e. Does that affect this task? Is this still relevant?

@Krinkle I see you modified wfGetDB() in I556627dc. Does that affect this task? Is this still relevant?

It's still relevant. The issue isn't with the wfGetDB function but with the Database interface underneath. It appears that, while rare, it happens on a regular basis that MediaWiki throws fatally "Cannot access the database: No working replica DB server" upon trying to connect to the database (by any means, wfGetDB, LBFactory, ConnRef etc.).

To answer myself, it has happened only once in the last fifteen days, there were 6 log entries in a two-second span, all originating from mw1251 and erroring out for db1123's IP (i.e. s3). I'd say keep the task in the backlog and act on it if this becomes a problem.

mobrovac lowered the priority of this task from Medium to Low.Sep 3 2019, 12:43 PM
mobrovac changed the subtype of this task from "Production Error" to "Task".
Recent samples
  • 2020-02-06T18:49:38 mw1247 gl.wikipedia.org GET /w/index.php?title= … XjxfvgpAMFIAAK9zcH4AAAAV
    • exception: Wikimedia\Rdbms\DBConnectionError: Cannot access the database: No working replica DB server: Unknown error (10.64.16.187)
    • trace: ExternalStoreDB->fetchBlob / SqlBlobStore->getBlob / Revision->getContent / RawAction->getRawText
    • db: glwiki (s3)
  • 2020-02-06T19:31:15 mw1270 commons.wikimedia.org GET /w/index.php?title= … XjxpgApAICgAALiHo40AAADI
    • exception:: Wikimedia\Rdbms\DBConnectionError: Cannot access the database: No working replica DB server: Unknown error (10.64.16.187)
    • trace: ExternalStoreDB->fetchBlob / SqlBlobStore->getBlob / Revision->getContent / ImagePage->view
    • db: commonswiki (s4)

While the above are from different wikis, they both are about blobs stored on 10.64.16.187 (es1014) which is part of the es3 "external store" cluster at WMF.

The es3 cluster contains two replicas: es1014 (10.64.16.187) and es1019 (10.64.48.116).

See Logstash query for all MW errors about these two hosts on Feb 6

Timeline
  • 2020-02-06 18:49 / GET for glwiki / XjxfvgpAMFIAAK9zcH4AAAAV
    • Error connecting to 10.64.16.187 as user wikiuser: :real_connect(): (HY000/2002): Connection timed out
    • Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
    • fatal DBConnectionError: Cannot access the database: No working replica DB server: Unknown error (10.64.16.187)
  • 2020-02-06 19:31 / GET for enwiki / XjxpgApAADkAAH9pjM4AAAEC
    • Error connecting to 10.64.16.187 as user wikiuser: :real_connect(): (HY000/2002): Connection timed out
    • Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
    • not fatal - presumably recovered by trying a different replica
  • 2020-02-0 19:31 / GET for dewiki / XjxpgApAAEMAABDlG4wAAACQ
    • Error connecting to 10.64.16.187 as user wikiuser: :real_connect(): (HY000/2002): Connection timed out
    • Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
    • not fatal - presumably recovered by trying a different replica
  • 2020-02-06 19:31 / GET for commonswiki / XjxpgApAICgAALiHo40AAADI
    • Error connecting to 10.64.16.187 as user wikiuser: :real_connect(): (HY000/2002): Connection timed out
    • Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
    • fatal DBConnectionError: Cannot access the database: No working replica DB server: Unknown error (10.64.16.187)
  • 2020-02-06 21:48 / GET for enwiki / XjyJpgpAAEMAAE-EYVMAAACF
    • Error connecting to 10.64.16.187 as user wikiuser: :real_connect(): (HY000/2002): Connection timed out
    • Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
    • not fatal

Closing for now because following a conversation with @Marostegui and @jcrespo, turns out es1019 was actually depooled during that hour, which means there wasn't another host for MW to fallback to or retry with. I'll create a separate task about enforcing in dbctl that there must always be two hosts pooled for MW.

There's also vaguely known issues with MW's retry logic and state tracking more generally, but at least for this task it seemed to not play a role.