mw1280 went down over the weekend. Serial console is stuck and "racadm getsel" shows "Critical: CPU 1 machine check error detected". I've depooled the server, you can power it down for further inspection or replacement of the CPU any time.
Description
Related Objects
Event Timeline
pasting the racadm log before I clear it
Record: 78
Date/Time: 05/27/2018 04:42:25
Source: system
Severity: Critical
Description: CPU 1 machine check error detected.
Record: 79
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 80
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 81
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 82
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 83
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok
Description: A problem was detected related to the previous server boot.
Record: 84
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Critical
Description: CPU 1 machine check error detected.
Record: 85
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 86
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 87
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
Record: 88
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.
@Cmjohnson Not seeing a new CPU error logged in "racadm getsel", but it's also still depooled and thus not receiving traffic (and may show up only under load). Unless you wanna do additional tests, I would go ahead and repool it?
Mentioned in SAL (#wikimedia-operations) [2018-06-07T12:15:41Z] <moritzm> repooled mw1280 after hardware maintenance (T195734)
No new errors have been logged in SEL and the server appears stable, closing the task.