cp5012 crashed on 2020-04-28 at 01:28:56, prior to the crash several memory errors have been logged:
Apr 28 01:25:15 cp5012 kernel: [2378102.616593] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 Apr 28 01:25:15 cp5012 kernel: [2378102.616595] {12}[Hardware Error]: It has been corrected by h/w and requires no further action Apr 28 01:25:15 cp5012 kernel: [2378102.616597] {12}[Hardware Error]: event severity: corrected Apr 28 01:25:15 cp5012 kernel: [2378102.616599] {12}[Hardware Error]: Error 0, type: corrected Apr 28 01:25:15 cp5012 kernel: [2378102.616599] {12}[Hardware Error]: fru_text: A3 Apr 28 01:25:15 cp5012 kernel: [2378102.616601] {12}[Hardware Error]: section_type: memory error Apr 28 01:25:15 cp5012 kernel: [2378102.616602] {12}[Hardware Error]: error_status: 0x0000000000000400 Apr 28 01:25:15 cp5012 kernel: [2378102.616603] {12}[Hardware Error]: physical_address: 0x0000003d00607fc0 Apr 28 01:25:15 cp5012 kernel: [2378102.616606] {12}[Hardware Error]: node: 0 card: 2 module: 0 rank: 0 bank: 1 row: 59392 column: 504 Apr 28 01:25:15 cp5012 kernel: [2378102.616607] {12}[Hardware Error]: error_type: 2, single-bit ECC
Checking the SEL it looks like cp5012 has been suffering RAM issues for a while now:
------------------------------------------------------------------------------- Record: 19 Date/Time: 03/25/2020 09:04:32 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_A5. ------------------------------------------------------------------------------- Record: 20 Date/Time: 03/25/2020 09:05:04 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_A5. ------------------------------------------------------------------------------- Record: 21 Date/Time: 04/28/2020 01:21:00 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_A3. ------------------------------------------------------------------------------- Record: 22 Date/Time: 04/28/2020 01:23:47 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_A3. ------------------------------------------------------------------------------- [...] ------------------------------------------------------------------------------- Record: 24 Date/Time: 04/28/2020 01:28:10 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. ------------------------------------------------------------------------------- Record: 25 Date/Time: 04/28/2020 01:28:10 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3. ------------------------------------------------------------------------------- Record: 26 Date/Time: 04/28/2020 01:28:10 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. ------------------------------------------------------------------------------- Record: 27 Date/Time: 04/28/2020 01:28:10 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3. ------------------------------------------------------------------------------- Record: 28 Date/Time: 04/28/2020 01:28:10 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. ------------------------------------------------------------------------------- Record: 29 Date/Time: 04/28/2020 01:28:10 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. -------------------------------------------------------------------------------