Page MenuHomePhabricator

db1128 host (containing m1 databases) crashed
Closed, ResolvedPublic

Description

Looks like memory errors

Databases in m1:

Database
bacula9
cas
cas_staging
dbbackups
etherpadlite
heartbeat
information_schema
librenms
mysql
percona
performance_schema
pki
racktables
rddmarc
rt
sys

Event Timeline

According to racadm lclog view it's a bad DIMM, DIMM_A6 in particular, and it happened already on 2022-03-17 (but it didn't trigger a reboot) and on 2022-02-27 (although this first error was a correctable one).

--------------------------------------------------------------------------------
SeqNumber       = 165
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2022-05-26 10:36:17
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 164
Message ID      = MEM0001
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-05-26 10:35:47
Message         = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.
Message Arg   1 = DIMM_A6
RawEventData    = 0x12,0x00,0x02,0x02,0x58,0x8F,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20

FQDD            = DIMM.Socket.A6
--------------------------------------------------------------------------------
SeqNumber       = 162
Message ID      = MEM0001
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-03-17 16:00:11
Message         = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.
Message Arg   1 = DIMM_A6
RawEventData    = 0x11,0x00,0x02,0x0B,0x5B,0x33,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20

FQDD            = DIMM.Socket.A6
--------------------------------------------------------------------------------
SeqNumber       = 161
Message ID      = MEM0702
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-02-27 10:57:24
Message         = Correctable memory error rate exceeded for DIMM_A6.
Message Arg   1 = DIMM_A6
RawEventData    = 0x10,0x00,0x02,0x14,0x59,0x1B,0x62,0xB1,0x00,0x04,0x0C,0x1B,0x07,0x12,0xE0,0x20

FQDD            = DIMM.Socket.A6
--------------------------------------------------------------------------------

Adding ops-eqiad for the hardware part of it.

We need to build a new host and switchover db1128 so we can replace its memory.

As an action item for later, we should check why the page didn't have the #page hashtag on the IRC alert:

icinga-wm| PROBLEM - Host db1128 is DOWN: PING CRITICAL - Packet loss = 100%

I am going to replace db1128 with a s4 host for now.

Should take a few hours and later I will do an emergency m1 switchover, don't want to leave db1128 running like this for the weekend

Mentioned in SAL (#wikimedia-operations) [2022-05-26T10:00:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1164 T309286', diff saved to https://phabricator.wikimedia.org/P28585 and previous config saved to /var/cache/conftool/dbconfig/20220526-100013-marostegui.json

Change 799876 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] instances.yaml: Remove db1164 from dbctl

https://gerrit.wikimedia.org/r/799876

Change 799876 merged by Marostegui:

[operations/puppet@production] instances.yaml: Remove db1164 from dbctl

https://gerrit.wikimedia.org/r/799876

Change 799883 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1164 to m1

https://gerrit.wikimedia.org/r/799883

Mentioned in SAL (#wikimedia-operations) [2022-05-26T10:05:38Z] <marostegui> Stop mysql on db1117:3321 to clone db1164 T309286

Change 799883 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1164 to m1

https://gerrit.wikimedia.org/r/799883

Change 799894 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] Update dbackups check and statistics to use db1164 instead of db1128

https://gerrit.wikimedia.org/r/799894

I created an specific task for DC-Ops: T309291

Change 799901 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Failover m1 primary from db1128 to db1164

https://gerrit.wikimedia.org/r/799901

Change 799915 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbproxy: Add db1164 as the m1 eqiad secondary

https://gerrit.wikimedia.org/r/799915

db1164 is up and running as a replacement master.

Change 799915 merged by Marostegui:

[operations/puppet@production] dbproxy: Add db1164 as the m1 eqiad secondary

https://gerrit.wikimedia.org/r/799915

Change 799901 merged by Marostegui:

[operations/puppet@production] mariadb: Failover m1 primary from db1128 to db1164

https://gerrit.wikimedia.org/r/799901

Change 799894 merged by Jcrespo:

[operations/puppet@production] Update dbackups check and statistics to use db1164 instead of db1128

https://gerrit.wikimedia.org/r/799894

I am going to close this as fixed as the pending follow ups have their own task:

Thanks everyone, especially Jaime for all the help mitigating this crash.