Page MenuHomePhabricator

Hardware error on elastic2094 - Comm Error: Backplane 0.
Closed, ResolvedPublic

Assigned To
Authored By
bking
Jan 24 2024, 10:13 PM
Referenced Files
F41740494: image.png
Feb 1 2024, 4:42 PM
F41740511: elastic2094 bios update.png
Feb 1 2024, 4:42 PM
F41739906: image.png
Feb 1 2024, 2:59 PM
F41739706: image.png
Feb 1 2024, 1:39 PM
F41739703: image.png
Feb 1 2024, 1:39 PM
F41739690: image.png
Feb 1 2024, 1:39 PM
F41739617: elastic2088 boot.png
Feb 1 2024, 1:21 PM

Description

Update

elastic2088 is now working
elastic2094 appears to have a hardware issue

Original description below

Hello DC Ops,

In the parent ticket, I've attempted to reimage these hosts to bullseye multiple times. I've attempted to upgrade the firmware, but that didn't seem to help. If I get on the DRAC during the installation, the installation appears to complete and the host reboots, but it looks like it's trying to PXE boot even after the installation, and eventually the console goes completely dark and I can't see anything.

Thanks for taking a look! Feel free to ping me in IRC (inflatador) if you need more info.

Event Timeline

BTullis subscribed.

I'm going to have a crack at these reimages, if that's OK.
Please let me know if I tread on anyone's toes, or if anyone else would like to take over.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS bullseye

I tried a reimage of elastic2008 and it completely hung at the PXE prompt, for at least 20 minutes before I switched it off.

elastic2088 boot.png (417×1 px, 72 KB)

I checked the firmware of the integrated 10 Gb NIC and it was at the correct version 21.85.21.92 as per: this note.

The sre.hardware.upgrade-firmware confirmed this:

btullis@cumin2002:~$ sudo cookbook sre.hardware.upgrade-firmware -c nic -n elastic2088.codfw.wmnet
<snip snip>
elastic2088.codfw.wmnet (Gen 15): starting
elastic2088.codfw.wmnet (NETWORK): update
elastic2088.codfw.wmnet (NETWORK): current version: 21.85.21.92

I tried forcibly reinstalling the same firmware version.

btullis@cumin2002:~$ sudo cookbook sre.hardware.upgrade-firmware -c nic -n -f elastic2088.codfw.wmnet
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2024-02-01 12:57:37.846145', 'owner': 'btullis@cumin2002 [1617017]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet']
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:elastic2088: {'concurrency': 1, 'created': '2024-02-01 12:57:38.236604', 'owner': 'btullis@cumin2002 [1617017]', 'ttl': 3600}
Management Password: 
elastic2088.codfw.wmnet (Gen 15): starting
elastic2088.codfw.wmnet (NETWORK): update
elastic2088.codfw.wmnet (NETWORK): current version: 21.85.21.92
poweredge-r450: picking DellDriverCategory.NETWORK update file
We have found multiple entries please pick from the list below:
0: /srv/firmware/poweredge-r450/NETWORK/Network_Firmware_4G8G9_WN64_22.61.8.EXE
1: /srv/firmware/poweredge-r450/NETWORK/Network_Firmware_TD0M9_WN64_15.20.16_A00-00.EXE
2: /srv/firmware/poweredge-r450/NETWORK/Network_Firmware_8FKR1_WN64_22.31.13.70.EXE
3: /srv/firmware/poweredge-r450/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE
4: Download new file
==> Please select the entry you want
> 3
User input is: "3"
elastic2088.codfw.wmnet (NETWORK): target_version: 21.85.21.92, current_version: 21.85.21.92
==> elastic2088.codfw.wmnet NETWORK: About to upload /srv/firmware/poweredge-r450/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
elastic2088.codfw.wmnet: skipping reboot version already correct (/redfish/v1/Chassis/System.Embedded.1/NetworkAdapters/NIC.Integrated.1)
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:elastic2088: {'concurrency': 1, 'created': '2024-02-01 12:57:38.236604', 'owner': 'btullis@cumin2002 [1617017]', 'ttl': 3600}
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2024-02-01 12:57:37.846145', 'owner': 'btullis@cumin2002 [1617017]', 'ttl': 1800}
END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2088.codfw.wmnet']

I then gave it a cold reboot from the BMC and selected to boot from PXE manually, which worked. The reimaging cookbook is continuing now.

So I can't yet tell if it was the cold boot that fixed it or the reinstalled firmware.

I'll try giving elastic2094 a cold boot before the reimaging cookbook the next time, to see if we can identify whether or not the firmware reinstallation is required.

I gave elastic2094 a cold boot, then started the reimage cookbook.
It is reporting the following error on the console.

image.png (436×733 px, 49 KB)

HWC8010: The System Configuration Check operation resulted in the following
issue: Comm Error: Backplane 0.

UEFI0116: One or more boot drivers have reported issue(s).
Check the Driver Health Menu in Boot Manager for details.

Launching the boot manager, the One-shot BIOS Boot Menu entry is disabled.

image.png (459×776 px, 27 KB)

image.png (443×750 px, 42 KB)

I'll try reinstalling the iDRAC firmware.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402011325_btullis_1413930_elastic2088.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS bullseye

It looks like elastic2094 may have some kind of hardware problem.

image.png (1×1 px, 221 KB)

I have tried both cold booting the server and cold resettng the BMC, but it hasn't made any difference.
@Papaul - could I hand this server over to you please, to have a look at?

BTullis renamed this task from Unable to reimage elastic2088 and elastic2094 to bullseye to Hardware error on elastic2094 - Comm Error: Backplane 0..Feb 1 2024, 3:01 PM
BTullis updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS bullseye

I have restarted the reimage cookbook for elastic2088, I realise that I should have selected puppet 7 instead of puppet 5.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS bullseye

I updated the system BIOS on elastic2094 from version 1.11.2 to version 1.12.1 but it didn't make any difference to the error.

elastic2094 bios update.png (1×2 px, 97 KB)

image.png (1×2 px, 228 KB)

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

  • elastic2094 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS bullseye completed:

  • elastic2088 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402011638_btullis_1447466_elastic2088.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@BTullis I can reseat the backplane to try and fix this. Is it safe for me to do so? or are you currently working on it?

@BTullis I can reseat the backplane to try and fix this. Is it safe for me to do so? or are you currently working on it?

Thanks @Jhancock.wm - you can shut down this machine and try reseating the backplane at any time.

@BTullis looks like it worked. But since that backplane error occurred twice already, if it happens again lmk and I'll put in a ticket with Dell for a replacement.

@BTullis looks like it worked. But since that backplane error occurred twice already, if it happens again lmk and I'll put in a ticket with Dell for a replacement.

Great, thanks. I'll give it another go now and see if the installation completes this time.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2094.codfw.wmnet with OS bullseye completed:

  • elastic2094 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402061526_btullis_2472791_elastic2094.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
BTullis claimed this task.

Thanks @Jhancock.wm - The reimage cookbook hung once at PXE boot, but I gave it a power cycle and manually selected to boot from PXE. Now we're all up and running. I'll close this ticket for now, but we can come back to it if we see this backplan error during any further reboots/reimages.