Looks like memory errors
Databases in m1:
Database bacula9 cas cas_staging dbbackups etherpadlite heartbeat information_schema librenms mysql percona performance_schema pki racktables rddmarc rt sys
Looks like memory errors
Databases in m1:
Database bacula9 cas cas_staging dbbackups etherpadlite heartbeat information_schema librenms mysql percona performance_schema pki racktables rddmarc rt sys
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Marostegui | T309286 db1128 host (containing m1 databases) crashed | |||
Resolved | • Marostegui | T309296 Failover m1 primary db from db1128 to db1164 | |||
Resolved | • Marostegui | T309303 Move db1128 from m1 (misc) to s1 (mediawiki) | |||
Resolved | Cmjohnson | T309291 db1128 faulty memory |
According to racadm lclog view it's a bad DIMM, DIMM_A6 in particular, and it happened already on 2022-03-17 (but it didn't trigger a reboot) and on 2022-02-27 (although this first error was a correctable one).
-------------------------------------------------------------------------------- SeqNumber = 165 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2022-05-26 10:36:17 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 164 Message ID = MEM0001 Category = System AgentID = SEL Severity = Critical Timestamp = 2022-05-26 10:35:47 Message = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6. Message Arg 1 = DIMM_A6 RawEventData = 0x12,0x00,0x02,0x02,0x58,0x8F,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20 FQDD = DIMM.Socket.A6 -------------------------------------------------------------------------------- SeqNumber = 162 Message ID = MEM0001 Category = System AgentID = SEL Severity = Critical Timestamp = 2022-03-17 16:00:11 Message = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6. Message Arg 1 = DIMM_A6 RawEventData = 0x11,0x00,0x02,0x0B,0x5B,0x33,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20 FQDD = DIMM.Socket.A6 -------------------------------------------------------------------------------- SeqNumber = 161 Message ID = MEM0702 Category = System AgentID = SEL Severity = Critical Timestamp = 2022-02-27 10:57:24 Message = Correctable memory error rate exceeded for DIMM_A6. Message Arg 1 = DIMM_A6 RawEventData = 0x10,0x00,0x02,0x14,0x59,0x1B,0x62,0xB1,0x00,0x04,0x0C,0x1B,0x07,0x12,0xE0,0x20 FQDD = DIMM.Socket.A6 --------------------------------------------------------------------------------
Adding ops-eqiad for the hardware part of it.
As an action item for later, we should check why the page didn't have the #page hashtag on the IRC alert:
icinga-wm| PROBLEM - Host db1128 is DOWN: PING CRITICAL - Packet loss = 100%
Should take a few hours and later I will do an emergency m1 switchover, don't want to leave db1128 running like this for the weekend
Mentioned in SAL (#wikimedia-operations) [2022-05-26T10:00:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1164 T309286', diff saved to https://phabricator.wikimedia.org/P28585 and previous config saved to /var/cache/conftool/dbconfig/20220526-100013-marostegui.json
Change 799876 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] instances.yaml: Remove db1164 from dbctl
Change 799876 merged by Marostegui:
[operations/puppet@production] instances.yaml: Remove db1164 from dbctl
Change 799883 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] mariadb: Move db1164 to m1
Mentioned in SAL (#wikimedia-operations) [2022-05-26T10:05:38Z] <marostegui> Stop mysql on db1117:3321 to clone db1164 T309286
Change 799883 merged by Marostegui:
[operations/puppet@production] mariadb: Move db1164 to m1
Change 799894 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] Update dbackups check and statistics to use db1164 instead of db1128
Change 799901 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] mariadb: Failover m1 primary from db1128 to db1164
Change 799915 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] dbproxy: Add db1164 as the m1 eqiad secondary
Change 799915 merged by Marostegui:
[operations/puppet@production] dbproxy: Add db1164 as the m1 eqiad secondary
Change 799901 merged by Marostegui:
[operations/puppet@production] mariadb: Failover m1 primary from db1128 to db1164
Change 799894 merged by Jcrespo:
[operations/puppet@production] Update dbackups check and statistics to use db1164 instead of db1128