On 2020-10-29 at 07:49 UTC the cloudcephosd1006 server stopped responding. Monitoring alerted.
To bring the server back online, @aborrero hard-rebooted it using the IPMI interface at 08:15 UTC.
Nothing interesting in syslog:
Oct 29 07:48:01 cloudcephosd1006 CRON[49089]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Oct 29 07:49:01 cloudcephosd1006 CRON[322]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Oct 29 08:15:21 cloudcephosd1006 systemd-modules-load[743]: Inserted module 'nf_conntrack' Oct 29 08:15:21 cloudcephosd1006 systemd-modules-load[743]: Inserted module 'ipmi_devintf' Oct 29 08:15:21 cloudcephosd1006 lvm[746]: 2 logical volume(s) in volume group "vg0" monitored Oct 29 08:15:21 cloudcephosd1006 systemd[1]: Started Load Kernel Modules.
The racadm interface showed some messages that might indicate hardware errors:
racadm>>racadm getsel [..] Record: 8 Date/Time: 10/29/2020 07:49:43 Source: system Severity: Non-Critical Description: Correctable Machine Check Exception detected on CPU 2. ------------------------------------------------------------------------------- Record: 9 Date/Time: 10/29/2020 07:49:43 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 10 Date/Time: 10/29/2020 07:49:43 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 11 Date/Time: 10/29/2020 07:49:43 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 12 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 13 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 14 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 15 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 16 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 17 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 18 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 19 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 20 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 21 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Critical Description: Correctable memory error logging disabled for a memory device at location DIMM_B1. ------------------------------------------------------------------------------- Record: 22 Date/Time: 10/29/2020 07:49:44 Source: system Severity: Critical Description: CPU 2 machine check error detected. ------------------------------------------------------------------------------- Record: 23 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 24 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 25 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 26 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 27 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 28 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 29 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 30 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 31 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 32 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 33 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 34 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 35 Date/Time: 10/29/2020 07:51:42 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- Record: 36 Date/Time: 10/29/2020 07:51:42 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. -------------------------------------------------------------------------------
Specifically:
------------------------------------------------------------------------------- Record: 34 Date/Time: 10/29/2020 07:49:45 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 35 Date/Time: 10/29/2020 07:51:42 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- Record: 36 Date/Time: 10/29/2020 07:51:42 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.