Page MenuHomePhabricator

hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet
Closed, ResolvedPublic

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc): Medium to high as we already have 5 servers of this k8s cluster in this exact state
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

I am seeing the following in logs after a failed reimage of wikikube-worker1243.eqiad.wmnet

racadm>>getsel
Record:      1
Date/Time:   07/15/2024 21:30:04
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/15/2024 21:38:46
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/15/2024 21:38:46
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/15/2024 21:38:49
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/15/2024 21:38:58
Source:      system
Severity:    Ok
Description: C: boot completed.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   07/15/2024 21:38:58
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/06/2025 09:52:02
Source:      system
Severity:    Critical
Description: The System Configuration Check operation resulted in the following issue: Comm Error: Backplane 0.
-------------------------------------------------------------------------------

Event Timeline

The following commands have to be executed when the host is back (just noting it down so I don't forget it):

cookbook sre.hosts.reimage --puppet 7 --new -t T377876 --os bookworm wikikube-worker1243
cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-eqiad -t T377876 pool wikikube-worker1243.eqiad.wmnet
JMeybohm renamed this task from Comm Error: backplane 0 when reimaging wikikube-worker1243 to hw troubleshooting: "Comm Error: backplane 0" for reimaging wikikube-worker1243.Jan 6 2025, 1:37 PM
JMeybohm assigned this task to Jclark-ctr.
JMeybohm updated the task description. (Show Details)
JMeybohm renamed this task from hw troubleshooting: "Comm Error: backplane 0" for reimaging wikikube-worker1243 to hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.Jan 6 2025, 1:40 PM
JMeybohm renamed this task from hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243 to hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet.
JMeybohm updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1243.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1243.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1243 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501091608_jclark_1271721_wikikube-worker1243.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Reimaged passed with no issues

Thanks @Jclark-ctr for the quick help and running the reimage one more time. The host looks good to me now.

I executed our add_k8s_node.py script, run homer and pooled the node, so this host is also done regarding T377876.