Page MenuHomePhabricator

hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet
Closed, ResolvedPublic

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

parse1002.eqiad.wmnet is down and per SEL it has a broken CPU:

Record:      165
Date/Time:   01/03/2023 02:20:24
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

I've set it to inactive in conftool, it can be powered down for analysis/replacement any time.

Host is downtimed for 7 days.

Event Timeline

Icinga downtime and Alertmanager silence (ID=8a35d570-5625-4e3d-a6ff-eb737a303711) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: CPU1 machine check error

parse1002.eqiad.wmnet
Clement_Goubert renamed this task from Broken CPU on parse1002 to hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet.Jan 3 2023, 10:38 AM
Clement_Goubert assigned this task to Cmjohnson.
Clement_Goubert triaged this task as High priority.
Clement_Goubert lowered the priority of this task from High to Medium.
Clement_Goubert updated Other Assignee, added: Jclark-ctr.
Clement_Goubert edited projects, added DC-Ops; removed SRE.
Clement_Goubert updated the task description. (Show Details)

Change 875360 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] dsh: Remove parse1002 from parsoid dsh group

https://gerrit.wikimedia.org/r/875360

Change 875360 merged by Clément Goubert:

[operations/puppet@production] dsh: Remove parse1002 from parsoid dsh group

https://gerrit.wikimedia.org/r/875360

Change 875360 merged by Clément Goubert:

[operations/puppet@production] dsh: Remove parse1002 from parsoid dsh group

https://gerrit.wikimedia.org/r/875360

Note not to forget to revert this once fixed

Sorry did not give update. Case# 159648923 was submitted 1/4/2023
Idrac was not reachable remotely. Reset Idrac with crash cart 1/6/2023
TSR report sent to Dell 1/6/2023

Thanks for the update. I will extend the downtime to two weeks from now, will revisit if necessary.

Icinga downtime and Alertmanager silence (ID=5c4b686a-9560-44c1-acb3-c16978d72b37) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1 host(s) and their services with reason: CPU1 machine check error

parse1002.eqiad.wmnet

Cleared SEL Dell requested set the system profile to performance

Cleared SEL Dell requested set the system profile to performance

The cpu governor is already performance, are Dell talking about a different kind of system profile?

@Clement_Goubert. I updated this morning. Dell has said this will resolve our issue I am closing this ticket and hope it isn’t going to return

Change 877207 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Revert "dsh: Remove parse1002 from parsoid dsh group"

https://gerrit.wikimedia.org/r/877207

@Jclark-ctr Thank you :) I'll repool the machine and remove the downtimes tomorrow.

Change 877207 merged by Clément Goubert:

[operations/puppet@production] Revert "dsh: Remove parse1002 from parsoid dsh group"

https://gerrit.wikimedia.org/r/877207

Mentioned in SAL (#wikimedia-operations) [2023-01-10T10:14:35Z] <claime> repooled parse1002.eqiad.wmnet - T326119