db1128 just crashed due to faulty memory (it is a database master):
According to racadm lclog view it's a bad DIMM, DIMM_A6 in particular, and it happened already on 2022-03-17 (but it didn't trigger a reboot) and on 2022-02-27 (although this first error was a correctable one).
-------------------------------------------------------------------------------- SeqNumber = 165 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2022-05-26 10:36:17 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 164 Message ID = MEM0001 Category = System AgentID = SEL Severity = Critical Timestamp = 2022-05-26 10:35:47 Message = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6. Message Arg 1 = DIMM_A6 RawEventData = 0x12,0x00,0x02,0x02,0x58,0x8F,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20 FQDD = DIMM.Socket.A6 -------------------------------------------------------------------------------- SeqNumber = 162 Message ID = MEM0001 Category = System AgentID = SEL Severity = Critical Timestamp = 2022-03-17 16:00:11 Message = Multi-bit memory errors detected on a memory device at location(s) DIMM_A6. Message Arg 1 = DIMM_A6 RawEventData = 0x11,0x00,0x02,0x0B,0x5B,0x33,0x62,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE0,0x20 FQDD = DIMM.Socket.A6 -------------------------------------------------------------------------------- SeqNumber = 161 Message ID = MEM0702 Category = System AgentID = SEL Severity = Critical Timestamp = 2022-02-27 10:57:24 Message = Correctable memory error rate exceeded for DIMM_A6. Message Arg 1 = DIMM_A6 RawEventData = 0x10,0x00,0x02,0x14,0x59,0x1B,0x62,0xB1,0x00,0x04,0x0C,0x1B,0x07,0x12,0xE0,0x20 FQDD = DIMM.Socket.A6 --------------------------------------------------------------------------------
Can we get new memory?