Page MenuHomePhabricator

Fatal error detected on elastic2088
Closed, ResolvedPublic

Description

Hello DC Ops,

elastic2088 was failing to reimage. It doesn't seem to respond to PXE. I thought it was fixed based on comments in T355830, but looking at the SEL I see the following message:

A fatal error was detected on a component at bus 100 device 4 function 0. Mon Mar 04 2024 21:01:51

Are you able to take a look at the hardware? Let us know if you need more info.

Event Timeline

Jhancock.wm added subscribers: Papaul, Jhancock.wm.

we actually have two devices with errors.
component at bus 100 device 4 function 0
component at bus 101 device 0 function 0

I'm not sure which devices they are aside from PCI so I did the following.

Checked the firmware. BIOS and iDRAC are up to date.
forced an updated the NIC firmware.
updated firmware of HBA card from 24.10 to 24.15.14.
Reset the iDRAC and did a power drain.
reseated the HBA card and the two empty risers, jic.
confirmed pxe is enabled, jic.

If this happens again we have all the documentation we need to open a ticket with Dell to replace the component.

Please try the reimage again and let me know if you have trouble.

@bking the pxe boot issue was that both 10G and 1G nic were set to pxe boot so that is why it was failing. i disable pxe boot on the 1G nic all good now.
You can resume the re-image

Change #1016009 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: Add elastic2088 back to production

https://gerrit.wikimedia.org/r/1016009

Change #1016009 merged by Bking:

[operations/puppet@production] elastic: Add elastic2088 back to production

https://gerrit.wikimedia.org/r/1016009