Page MenuHomePhabricator

PXE boot NIC firmware regression
Open, Needs TriagePublic

Description

Follow up from the long investigation in T303776#7781198 and T303776#7797564.

We finally found the root cause and @Papaul identified the faulty firmware version:

Downgrade NIC firmware on cloudvirt1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the Failed to load ldlinux.c32 issue

So I'm not a PXE expert, so I don't know if the issue can be solved by updating the lpxelinux.0 binary, but at this point we should:

  • Follow up with the vendor so they can provider a fixed firmware (or guidance on how to workaround the issue)
  • Check if there is any server running the faulty version and downgrade the firmware (or at least warn the service owner)
  • List all the servers using the same NIC so we make sure to not upgrade them (and see the scale of the potential issue)

Event Timeline

@Papaul, cloudvirt1024 reports using version 21.85.21.92 in PuppetDB, so is that version a good one too? Is there a firmware changelog somewhere?

The 22.0 firmware is in fact really fresh, it was only released two weeks ago: https://www.dell.com/support/home/en-uk/drivers/driversdetails?driverid=npnt5&lwp=rt
I think it's worth reporting, maybe it's a regression that others have also seen.

Fixes:
- Fix performance concerns in Multicast streaming environment.

Enhancements:
- Add firmware upgrade support for new adapters.
- Change Number of MSI-X Vectors per VF default to 8 for 5750x adapters.

@ayounsi cloudvirt1024 is a R440 with version 22.00.07.60 we are getting the "Failed to load ldlinux.c32" with version 21.60.22.11 we are getting the error below

Network autoconfiguration failed                    
   │ Your network is probably not using the DHCP protocol. Alternatively,  │    
   │ the DHCP server may be slow or some network hardware is not working   │    
   │ properly.

on the other hand cloudvirt1025 and 1026 are R640 getting the same error with version 22.00.07.60 but working with version 21.60.22.11
the next version that is available for R640 after 21.60.22.11 is 21.85.21.92 same as the one running on cloudvirt1024 . I was hoping to test 21.85.21.92 on coudvirt1027 today to make sure it works on R640 so that we can have all the servers running the 21.85.21.92 version but looks like @Andrew already re-imaged cloudvirt1027

It looks like stat1010 might also be affected by this issue. I've had successive PXE boot failures from the sre.hosts.reimage cookbook,
DHCP is successful and the correct DHCP fragment appears to be in place on install1003.

image.png (318×714 px, 30 KB)

btullis@install1003:/etc/dhcp$ cat automation/ttyS1-115200/stat1010.conf

host stat1010 {
    host-identifier option agent.circuit-id "lsw1-e1-eqiad:xe-0/0/24.0:analytics1-e1-eqiad";
    fixed-address 10.64.138.6;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";
}

I'm investigating firmware levels now.

This also has version 22.00.07.60 installed to the NIC.

image.png (406×733 px, 30 KB)

According to Dell this still appears to be the latest version of the firmware.
I will try downloading to 21.85.21.92 as suggested ealier in this ticket.

Yes, it worked! I didn't make any other changes, apart from downgrading 22.00.07.60 to 21.85.21.92 and now it has obtained the ldlinux.c32 file over TFTP.
Please feel free to let me know if there's anything else that I can usefully do to help reporting this issue to Dell or remediate the issue across any other new servers.

I used the virtual media option within the iDRAC, so I haven't made the firmware available by TFTP/HTTP anywhere. If it would help I can write up the method I used.

Hi folks! I got a problem when tring to reimage kafka-main1005 to Bullseye, and Moritz told me that there should be a new NIC firmware to test :)