Page MenuHomePhabricator

MediaWiki: "host db1062 is unreachable" (Connection refused)
Closed, ResolvedPublicPRODUCTION ERROR

Description

DBConnection ERROR
Error connecting to db1062 as user wikiuser: :real_connect(): (HY000/2002): Connection refused
DBReplication ERROR
Wikimedia\Rdbms\LoadMonitor::getServerStates: host db1062 is unreachable
(trace)
#0 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/DatabaseMysqlBase.php(143): Wikimedia\Rdbms\Database->newExceptionAfterConnectError(':real_connect()...')
#1 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/Database.php(319): Wikimedia\Rdbms\DatabaseMysqlBase->open('db1062',  …)
…
#4 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1207): Wikimedia\Rdbms\LoadBalancer->reallyOpenConnection(5, Object(Wikimedia\Rdbms\DatabaseDomain), Array)
#5 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(928): Wikimedia\Rdbms\LoadBalancer->getForeignConnection(5, '', 3)
…
#8 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2211): Wikimedia\Rdbms\LoadMonitor->getLagTimes(Array, 'centralauth')
…
#29 /srv/mediawiki/php-1.35.0-wmf.5/includes/block/BlockManager.php(470): User->getBlock()
#30 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(751): MediaWiki\Block\BlockManager->trackBlockWithCookie(Object(User), Object(WebResponse))
#31 /srv/mediawiki/php-1.35.0-wmf.5/includes/api/ApiMain.php(565): MediaWiki::preOutputCommit(Object(DerivativeContext))
#32 /srv/mediawiki/php-1.35.0-wmf.5/includes/api/ApiMain.php(510): ApiMain->executeActionWithErrorHandling()
#33 /srv/mediawiki/php-1.35.0-wmf.5/api.php(83): ApiMain->execute()
Impact

4000 hits per hour every hour since about 24 hours ago 4 Dec 2019 ~03:00 UTC.

I've not been able to correlate this with a fatal error from MediaWiki which suggests retries are working and holding us up, but likely at the cost of slower DB connections from MW for all s7 wikis databases.

Note though, that while this is an s7 wiki host, it appears to be hit during web requests for all other wikis as well (eg. enwiki, eswiki). Might be due to centralauth/metawiki being on s7.

Screenshot 2019-12-04 at 23.20.19.png (826×2 px, 203 KB)

Event Timeline

Ping T239188: Decommission db1062.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2019-12-03T08:29:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db1062 from config T239188 (duration: 01m 08s)

I guess that didn't quite work as intended.

According to https://noc.wikimedia.org/db.php, db1062 is still part of s7.

Also, less frequently, I also see MW errors in Logstash about two other s7 hosts:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBConnectionError from line 1415 of /srv/mediawiki/php-1.35.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php:
Cannot access the database: Unknown error (10.64.48.109)

10.64.48.109 is the IP of host db1136 (also part of s7)

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBConnectionError from line 1415 of /srv/mediawiki/php-1.35.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php:
Cannot access the database: Unknown error (10.64.48.153)

Where is the IP of host db1094 (also part of s7).

That seems rather concerning, one s7 host is fully unreachable and two others are (sometimes) unreachable?

Krinkle updated the task description. (Show Details)

From https://noc.wikimedia.org/db.php#tabs-8 and https://noc.wikimedia.org/dbconfig/eqiad.json:

db1086: 0
db1062: 0

etcd…/eqiad.json
"s7": [
  {
    "db1086": 0
  },
  {
    "db1062": 0,
    "db1079": 200,
    "db1090:3317": 1,
    "db1094": 500,
    "db1098:3317": 150,
    "db1101:3317": 150,
    "db1136": 400
  }
],

Two hosts that both claim to be the master / have 0 read load assigned. The first one is correct but the second one presumably isn't meant to be there. I don't actually know if it's valid to have a slave with 0 load. I guess that's for a follow-up to reject in the dbctl schema to prevent future issues so that one is forced to either remove it or assign at least 1 load (which is effectively the same as 0 given that 0 doesn't mean no interaction at all per the current issue). Having said that, I also don't know under which circumstances (aside from replag wait) a 0-load slave would be used. Perhaps if others are lagged / unused, which is quite likely given it is the only host with no other read queries so it might win regularly when weighing load against least-lagged meta data. To be continued :)

It seems clear that db1062 shouldn't be pooled anywhere. Ran the dbctl depool utility and it's gone from s7.

Krinkle claimed this task.

Thanks. Continuing at T239877 about the other two, which seems unrelated at this point (they are still "normal" replica dbs afaik).

Thanks for tackling this and apologies for not depooling it when I should've.
db1062 stopped being a master a week ago

1--- eqiad/readOnlyBySection live
2+++ eqiad/readOnlyBySection generated
3@@ -1,3 +1 @@
4-{
5- "s7": "Maintenance till 06:30AM UTC T238044"
6-}
7+{}
8--- eqiad/sectionLoads live
9+++ eqiad/sectionLoads generated
10@@ -78,25 +78,25 @@
11 "db1085": 300,
12 "db1088": 500,
13 "db1093": 400,
14 "db1096:3316": 1,
15 "db1098:3316": 1,
16 "db1113:3316": 1
17 }
18 ],
19 "s7": [
20 {
21- "db1062": 0
22+ "db1086": 0
23 },
24 {
25+ "db1062": 0,
26 "db1079": 200,
27- "db1086": 0,
28 "db1090:3317": 1,
29 "db1094": 500,
30 "db1098:3317": 150,
31 "db1101:3317": 150,
32 "db1136": 400
33 }
34 ],
35 "s8": [
36 {
37 "db1109": 0

We normally depool it straightaway (that is removing it from the list of hosts, not leaving it with weight 0) but looks like I forgot to do so at that same moment.
Yesterday, as part of the decommissioning process, I stopped MySQL and remove the host from MW config: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/554243/
However, as we normally have depooled it from dbctl, I assumed it was depooled, but it wasn't.

I have amended our master switchover checklists, to reflect the fact that it needs to be depooled from dbctl (if it will go under maintenance/decommissioning) so this doesn't happen again.

Thank you all for solving this and sorry again for the inconveniences.