Page MenuHomePhabricator

mw1280: CPU error
Closed, ResolvedPublic

Description

mw1280 went down over the weekend. Serial console is stuck and "racadm getsel" shows "Critical: CPU 1 machine check error detected". I've depooled the server, you can power it down for further inspection or replacement of the CPU any time.

Event Timeline

pasting the racadm log before I clear it

Record: 78
Date/Time: 05/27/2018 04:42:25
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Record: 79
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 80
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 81
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 82
Date/Time: 05/27/2018 04:42:26
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 83
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 84
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Record: 85
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 86
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 87
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 88
Date/Time: 05/27/2018 04:46:09
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Swapped the CPU's today to see if error follows

@Cmjohnson Not seeing a new CPU error logged in "racadm getsel", but it's also still depooled and thus not receiving traffic (and may show up only under load). Unless you wanna do additional tests, I would go ahead and repool it?

Mentioned in SAL (#wikimedia-operations) [2018-06-07T12:15:41Z] <moritzm> repooled mw1280 after hardware maintenance (T195734)

No new errors have been logged in SEL and the server appears stable, closing the task.

Vvjjkkii renamed this task from mw1280: CPU error to 95baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Cmjohnson; removed: Aklapper.
CommunityTechBot renamed this task from 95baaaaaaa to mw1280: CPU error.Jul 2 2018, 3:35 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Cmjohnson.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: Aklapper; removed: Cmjohnson.