Page MenuHomePhabricator

hw troubleshooting: hard down for wikikube-worker2142
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: wikikube-worker2142.codfw.wmnet
Urgency: low (part of wikikube k8s cluster)

Hard-rebooted from management console, the server boots, but no ping.

-------------------------------------------------------------------------------
Record:      11
Date/Time:   04/07/2025 16:56:45
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 75 device 0 function 0.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   04/07/2025 16:56:45
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 74 device 2 function 0.
-------------------------------------------------------------------------------

Event Timeline

depool host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: Hardware failure

Host drained forcefully and depooled.

check host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: Hardware failure

Icinga downtime and Alertmanager silence (ID=b0849406-4915-4da3-8220-1f360a73f331) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hardware failure

wikikube-worker2142.codfw.wmnet

starting with firmware updates. hopefully we'll get a more concise error.

the NIC card has perished. I am opening an return with Dell.

Jhancock.wm claimed this task.

@Clement_Goubert arrived and replaced. ran provisioning cookbook and it pings now. Let us know if you need any additional help!

Mentioned in SAL (#wikimedia-operations) [2025-04-11T15:19:42Z] <claime> homer lsw1-c2-codfw* commit T391341

pool host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker2142.codfw.wmnet completed:

  • wikikube-worker2142.codfw.wmnet (PASS)
    • Host wikikube-worker2142.codfw.wmnet pooled in wikikube-codfw