Page MenuHomePhabricator

db2173 crashed and didn't alert
Closed, ResolvedPublic

Description

On Saturday 12th db2173 (sanitarium master) crashed:

19:42:35 <+icinga-wm> PROBLEM - Host db2173 is DOWN: PING CRITICAL - Packet loss = 100%
19:45:31 <+icinga-wm> PROBLEM - MariaDB Replica IO: s1 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2173.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2173.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

Apparently this didn't page anyone.
@fgiunchedi can you help us understand why we didn't get a page? Notifications are definitely enabled

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

I have manually depooled this host now

Change 856199 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2173: Disable notifications

https://gerrit.wikimedia.org/r/856199

Change 856199 merged by Marostegui:

[operations/puppet@production] db2173: Disable notifications

https://gerrit.wikimedia.org/r/856199

-------------------------------------------------------------------------------
Record:      6
Date/Time:   11/12/2022 17:40:28
Source:      system
Severity:    Critical
Description: CPU 1 MEM345 VTT PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   11/12/2022 17:40:28
Source:      system
Severity:    Critical
Description: CPU 1 MEM345 VPP PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   11/12/2022 17:41:47
Source:      system
Severity:    Critical
Description: The system board Pfault fail-safe voltage is outside of range.
-------------------------------------------------------------------------------
Marostegui renamed this task from db2173 crashed to db2173 crashed and didn't alert.Nov 14 2022, 6:12 AM

Created T322988: db2173 HW errors for the HW on-site troubleshooting specific part

Change 856375 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2094: Disable notifications

https://gerrit.wikimedia.org/r/856375

Change 856375 merged by Marostegui:

[operations/puppet@production] db2094: Disable notifications

https://gerrit.wikimedia.org/r/856375

Change 856939 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Make codfw hosts ping when going down

https://gerrit.wikimedia.org/r/856939

So I think this host didn't alert because of this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736415

Per that commit message, sanitarium masters would only page in active DCs, which was a good approach as codfw wasn't active and we don't have any clouddb hosts hanging in codfw, but we should change that now.

I have put this up, but it really needs good reviews cause I am not sure if it will have the desired effect, I will get @jbond and @fgiunchedi to review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856939

Change 856939 merged by Marostegui:

[operations/puppet@production] mariadb: Make codfw hosts ping when going down

https://gerrit.wikimedia.org/r/856939

Marostegui claimed this task.

Change 856939 merged by Marostegui:

[operations/puppet@production] mariadb: Make codfw hosts ping when going down

https://gerrit.wikimedia.org/r/856939

I believe merging this patch has solved this. So closing for now.

The host itself needs to be fixed, that that has its own task: T322988