Host mw2286 ( codfw row D / D4) is stuck after reboot. The host did not recovered properly from reboot (no ssh or network connectivity on main interface). The hosts is responding on mw2286.mgmt.codfw.wmnet. A racadm serveraction powercycle did not help to properly reboot the server.
racadm lists some critical errors:
/admin1-> racadm getsel Record: 1 Date/Time: 02/19/2018 17:02:49 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Record: 2 Date/Time: 03/29/2018 20:01:21 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_A1. ------------------------------------------------------------------------------- Record: 3 Date/Time: 03/29/2018 20:01:21 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_A1. ------------------------------------------------------------------------------- Record: 4 Date/Time: 08/17/2018 10:27:03 Source: system Severity: Ok Description: A problem was detected in Memory Reference Code (MRC). ------------------------------------------------------------------------------- Record: 5 Date/Time: 08/17/2018 10:27:03 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. ------------------------------------------------------------------------------- Record: 6 Date/Time: 02/01/2021 20:26:06 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_A1. ------------------------------------------------------------------------------- Record: 7 Date/Time: 02/01/2021 20:26:06 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_A1. ------------------------------------------------------------------------------- Record: 8 Date/Time: 04/25/2022 14:30:42 Source: system Severity: Ok Description: A problem was detected in Memory Reference Code (MRC). ------------------------------------------------------------------------------- Record: 9 Date/Time: 04/25/2022 14:30:42 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. ------------------------------------------------------------------------------- Record: 10 Date/Time: 04/25/2022 15:04:47 Source: system Severity: Ok Description: A problem was detected in Memory Reference Code (MRC). ------------------------------------------------------------------------------- Record: 11 Date/Time: 04/25/2022 15:04:47 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. ------------------------------------------------------------------------------- Record: 12 Date/Time: 04/25/2022 15:42:32 Source: system Severity: Ok Description: A problem was detected in Memory Reference Code (MRC). ------------------------------------------------------------------------------- Record: 13 Date/Time: 04/25/2022 15:42:32 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. -------------------------------------------------------------------------------