Page MenuHomePhabricator

db1128 faulty memory
Closed, ResolvedPublic

Description

db1128 just crashed due to faulty memory (it is a database master):

According to racadm lclog view it's a bad DIMM, DIMM_A6 in particular, and it happened already on 2022-03-17 (but it didn't trigger a reboot) and on 2022-02-27 (although this first error was a correctable one).

--------------------------------------------------------------------------------
SeqNumber       = 165
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2022-05-26 10:36:17
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 164
Message ID      = MEM0001
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-05-26 10:35:47
Message         = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.
Message Arg   1 = DIMM_A6
RawEventData    = 0x12,0x00,0x02,0x02,0x58,0x8F,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20

FQDD            = DIMM.Socket.A6
--------------------------------------------------------------------------------
SeqNumber       = 162
Message ID      = MEM0001
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-03-17 16:00:11
Message         = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.
Message Arg   1 = DIMM_A6
RawEventData    = 0x11,0x00,0x02,0x0B,0x5B,0x33,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20

FQDD            = DIMM.Socket.A6
--------------------------------------------------------------------------------
SeqNumber       = 161
Message ID      = MEM0702
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-02-27 10:57:24
Message         = Correctable memory error rate exceeded for DIMM_A6.
Message Arg   1 = DIMM_A6
RawEventData    = 0x10,0x00,0x02,0x14,0x59,0x1B,0x62,0xB1,0x00,0x04,0x0C,0x1B,0x07,0x12,0xE0,0x20

FQDD            = DIMM.Socket.A6
--------------------------------------------------------------------------------

Can we get new memory?

Event Timeline

Sounds good @wiki_willy - let us know when we'd need to schedule some downtime for the host.
Thanks!

Chatted with @Marostegui and we are planning downtime for tomorrow 3 June

Mentioned in SAL (#wikimedia-operations) [2022-06-03T05:19:01Z] <marostegui> Stop mysql on db1128 for on-site maintenance T309291

@Cmjohnson db1128 is now off and ready for you to change its DIMM anytime. Once done please bring it back and I will start mysql etc.

Thanks!

replaced the DIMM and updated BIOS