MediaWiki: "host db1062 is unreachable" (Connection refused)
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	Krinkle
	Dec 4 2019, 11:27 PM

Description

DBConnection ERROR

Error connecting to db1062 as user wikiuser: :real_connect(): (HY000/2002): Connection refused

DBReplication ERROR

Wikimedia\Rdbms\LoadMonitor::getServerStates: host db1062 is unreachable

(trace)

#0 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/DatabaseMysqlBase.php(143): Wikimedia\Rdbms\Database->newExceptionAfterConnectError(':real_connect()...')
#1 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/database/Database.php(319): Wikimedia\Rdbms\DatabaseMysqlBase->open('db1062',  …)
…
#4 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1207): Wikimedia\Rdbms\LoadBalancer->reallyOpenConnection(5, Object(Wikimedia\Rdbms\DatabaseDomain), Array)
#5 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(928): Wikimedia\Rdbms\LoadBalancer->getForeignConnection(5, '', 3)
…
#8 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2211): Wikimedia\Rdbms\LoadMonitor->getLagTimes(Array, 'centralauth')
…
#29 /srv/mediawiki/php-1.35.0-wmf.5/includes/block/BlockManager.php(470): User->getBlock()
#30 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(751): MediaWiki\Block\BlockManager->trackBlockWithCookie(Object(User), Object(WebResponse))
#31 /srv/mediawiki/php-1.35.0-wmf.5/includes/api/ApiMain.php(565): MediaWiki::preOutputCommit(Object(DerivativeContext))
#32 /srv/mediawiki/php-1.35.0-wmf.5/includes/api/ApiMain.php(510): ApiMain->executeActionWithErrorHandling()
#33 /srv/mediawiki/php-1.35.0-wmf.5/api.php(83): ApiMain->execute()

Impact

4000 hits per hour every hour since about 24 hours ago 4 Dec 2019 ~03:00 UTC.

I've not been able to correlate this with a fatal error from MediaWiki which suggests retries are working and holding us up, but likely at the cost of slower DB connections from MW for all s7 wikis databases.

Note though, that while this is an s7 wiki host, it appears to be hit during web requests for all other wikis as well (eg. enwiki, eswiki). Might be due to centralauth/metawiki being on s7.

Screenshot 2019-12-04 at 23.20.19.png (826×2 px, 203 KB)

Related Objects

Mentioned In: T239901: Disallow 'weight: 0' for MW db config in dbctl
T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs
T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error"
Mentioned Here: P9743 dbctl commit (dc=all): 'Promote db1086 on s7 master and remove read-only from s7 T238044'
T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error"
T239188: Decommission db1062.eqiad.wmnet

Event Timeline

Krinkle created this task.Dec 4 2019, 11:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 4 2019, 11:27 PM

Ping T239188: Decommission db1062.eqiad.wmnet

In T239188#5708209, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2019-12-03T08:29:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db1062 from config T239188 (duration: 01m 08s)

I guess that didn't quite work as intended.

Krinkle moved this task from Untriaged to Dec2019/1.35.wmf.10+ on the Wikimedia-production-error board.Dec 4 2019, 11:29 PM

According to https://noc.wikimedia.org/db.php, db1062 is still part of s7.

Also, less frequently, I also see MW errors in Logstash about two other s7 hosts:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBConnectionError from line 1415 of /srv/mediawiki/php-1.35.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php:
Cannot access the database: Unknown error (10.64.48.109)

10.64.48.109 is the IP of host db1136 (also part of s7)

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBConnectionError from line 1415 of /srv/mediawiki/php-1.35.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php:
Cannot access the database: Unknown error (10.64.48.153)

Where is the IP of host db1094 (also part of s7).

That seems rather concerning, one s7 host is fully unreachable and two others are (sometimes) unreachable?

Krinkle mentioned this in T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error".Dec 5 2019, 12:23 AM

Krinkle updated the task description. (Show Details)Dec 5 2019, 12:27 AM

Krinkle updated the task description. (Show Details)

brennen subscribed.Dec 5 2019, 12:29 AM

From https://noc.wikimedia.org/db.php#tabs-8 and https://noc.wikimedia.org/dbconfig/eqiad.json:

db1086: 0
db1062: 0

etcd…/eqiad.json

"s7": [
  {
    "db1086": 0
  },
  {
    "db1062": 0,
    "db1079": 200,
    "db1090:3317": 1,
    "db1094": 500,
    "db1098:3317": 150,
    "db1101:3317": 150,
    "db1136": 400
  }
],

Two hosts that both claim to be the master / have 0 read load assigned. The first one is correct but the second one presumably isn't meant to be there. I don't actually know if it's valid to have a slave with 0 load. I guess that's for a follow-up to reject in the dbctl schema to prevent future issues so that one is forced to either remove it or assign at least 1 load (which is effectively the same as 0 given that 0 doesn't mean no interaction at all per the current issue). Having said that, I also don't know under which circumstances (aside from replag wait) a 0-load slave would be used. Perhaps if others are lagged / unused, which is quite likely given it is the only host with no other read queries so it might win regularly when weighing load against least-lagged meta data. To be continued :)

It seems clear that db1062 shouldn't be pooled anywhere. Ran the dbctl depool utility and it's gone from s7.

Thanks. Continuing at T239877 about the other two, which seems unrelated at this point (they are still "normal" replica dbs afaik).

Thanks for tackling this and apologies for not depooling it when I should've.
db1062 stopped being a master a week ago

234567891011 12 13 14 15 16 17 18 19 20 212223 24 2526 2728 29 30 31 32 33 34 35 36 37

href="/P9743">P9743 dbctl commit (dc=all): 'Promote db1086 on s7 master and remove read-only from s7 T238044'

1	--- eqiad/readOnlyBySection live class="gi">+++ eqiad/readOnlyBySection generated class="gu">@@ -1,3 +1 @@ class="gd">-{ class="gd">- "s7": "Maintenance till 06:30AM UTC T238044" class="gd">-} class="gi">+{} class="gd">--- eqiad/sectionLoads live class="gi">+++ eqiad/sectionLoads generated class="gu">@@ -78,25 +78,25 @@ "db1085": 300, "db1088": 500, "db1093": 400, "db1096:3316": 1, "db1098:3316": 1, "db1113:3316": 1 } ], "s7": [ { class="gd">- "db1062": 0 class="gi">+ "db1086": 0 }, { class="gi">+ "db1062": 0, "db1079": 200, class="gd">- "db1086": 0, "db1090:3317": 1, "db1094": 500, "db1098:3317": 150, "db1101:3317": 150, "db1136": 400 } ], "s8": [ { "db1109": 0 We normally depool it straightaway (that is removing it from the list of hosts, not leaving it with weight 0) but looks like I forgot to do so at that same moment. Yesterday, as part of the decommissioning process, I stopped MySQL and remove the host from MW config: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/554243/ However, as we normally have depooled it from `dbctl`, I assumed it was depooled, but it wasn't. I have amended our master switchover checklists, to reflect the fact that it needs to be depooled from dbctl (if it will go under maintenance/decommissioning) so this doesn't happen again. Thank you all for solving this and sorry again for the inconveniences. Krinkle mentioned this in T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs.Dec 5 2019, 11:41 AM Krinkle mentioned this in T239901: Disallow 'weight: 0' for MW db config in dbctl.Dec 5 2019, 11:44 AM Log In to Comment Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits

--- eqiad/readOnlyBySection live class="gi">+++ eqiad/readOnlyBySection generated class="gu">@@ -1,3 +1 @@ class="gd">-{ class="gd">- "s7": "Maintenance till 06:30AM UTC T238044" class="gd">-} class="gi">+{} class="gd">--- eqiad/sectionLoads live class="gi">+++ eqiad/sectionLoads generated class="gu">@@ -78,25 +78,25 @@ "db1085": 300, "db1088": 500, "db1093": 400, "db1096:3316": 1, "db1098:3316": 1, "db1113:3316": 1 } ], "s7": [ { class="gd">- "db1062": 0 class="gi">+ "db1086": 0 }, { class="gi">+ "db1062": 0, "db1079": 200, class="gd">- "db1086": 0, "db1090:3317": 1, "db1094": 500, "db1098:3317": 150, "db1101:3317": 150, "db1136": 400 } ], "s8": [ { "db1109": 0

We normally depool it straightaway (that is removing it from the list of hosts, not leaving it with weight 0) but looks like I forgot to do so at that same moment.
Yesterday, as part of the decommissioning process, I stopped MySQL and remove the host from MW config: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/554243/
However, as we normally have depooled it from dbctl, I assumed it was depooled, but it wasn't.

I have amended our master switchover checklists, to reflect the fact that it needs to be depooled from dbctl (if it will go under maintenance/decommissioning) so this doesn't happen again.

Thank you all for solving this and sorry again for the inconveniences.

Krinkle mentioned this in T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs.Dec 5 2019, 11:41 AM

Krinkle mentioned this in T239901: Disallow 'weight: 0' for MW db config in dbctl.Dec 5 2019, 11:44 AM

	F31456930: Screenshot 2019-12-04 at 23.20.19.png
	Dec 4 2019, 11:27 PM

MediaWiki: "host db1062 is unreachable" (Connection refused)Closed, ResolvedPublicPRODUCTION ERRORActions

Description

Impact

Related Objects

Event Timeline

MediaWiki: "host db1062 is unreachable" (Connection refused)
Closed, ResolvedPublicPRODUCTION ERROR
Actions