Page MenuHomePhabricator

PXE boot NIC firmware regression
Open, MediumPublic

Description

Follow up from the long investigation in T303776#7781198 and T303776#7797564.

We finally found the root cause and @Papaul identified the faulty firmware version:

Downgrade NIC firmware on cloudvirt1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the Failed to load ldlinux.c32 issue

So I'm not a PXE expert, so I don't know if the issue can be solved by updating the lpxelinux.0 binary, but at this point we should:

  • Follow up with the vendor so they can provider a fixed firmware (or guidance on how to workaround the issue)
  • Check if there is any server running the faulty version and downgrade the firmware (or at least warn the service owner)
  • List all the servers using the same NIC so we make sure to not upgrade them (and see the scale of the potential issue)

Event Timeline

@Papaul, cloudvirt1024 reports using version 21.85.21.92 in PuppetDB, so is that version a good one too? Is there a firmware changelog somewhere?

The 22.0 firmware is in fact really fresh, it was only released two weeks ago: https://www.dell.com/support/home/en-uk/drivers/driversdetails?driverid=npnt5&lwp=rt
I think it's worth reporting, maybe it's a regression that others have also seen.

Fixes:
- Fix performance concerns in Multicast streaming environment.

Enhancements:
- Add firmware upgrade support for new adapters.
- Change Number of MSI-X Vectors per VF default to 8 for 5750x adapters.

@ayounsi cloudvirt1024 is a R440 with version 22.00.07.60 we are getting the "Failed to load ldlinux.c32" with version 21.60.22.11 we are getting the error below

Network autoconfiguration failed                    
   │ Your network is probably not using the DHCP protocol. Alternatively,  │    
   │ the DHCP server may be slow or some network hardware is not working   │    
   │ properly.

on the other hand cloudvirt1025 and 1026 are R640 getting the same error with version 22.00.07.60 but working with version 21.60.22.11
the next version that is available for R640 after 21.60.22.11 is 21.85.21.92 same as the one running on cloudvirt1024 . I was hoping to test 21.85.21.92 on coudvirt1027 today to make sure it works on R640 so that we can have all the servers running the 21.85.21.92 version but looks like @Andrew already re-imaged cloudvirt1027

It looks like stat1010 might also be affected by this issue. I've had successive PXE boot failures from the sre.hosts.reimage cookbook,
DHCP is successful and the correct DHCP fragment appears to be in place on install1003.

image.png (318×714 px, 30 KB)

btullis@install1003:/etc/dhcp$ cat automation/ttyS1-115200/stat1010.conf

host stat1010 {
    host-identifier option agent.circuit-id "lsw1-e1-eqiad:xe-0/0/24.0:analytics1-e1-eqiad";
    fixed-address 10.64.138.6;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";
}

I'm investigating firmware levels now.

This also has version 22.00.07.60 installed to the NIC.

image.png (406×733 px, 30 KB)

According to Dell this still appears to be the latest version of the firmware.
I will try downloading to 21.85.21.92 as suggested ealier in this ticket.

Yes, it worked! I didn't make any other changes, apart from downgrading 22.00.07.60 to 21.85.21.92 and now it has obtained the ldlinux.c32 file over TFTP.
Please feel free to let me know if there's anything else that I can usefully do to help reporting this issue to Dell or remediate the issue across any other new servers.

I used the virtual media option within the iDRAC, so I haven't made the firmware available by TFTP/HTTP anywhere. If it would help I can write up the method I used.

Hi folks! I got a problem when tring to reimage kafka-main1005 to Bullseye, and Moritz told me that there should be a new NIC firmware to test :)

joanna_borun raised the priority of this task from Low to Medium.Jan 22 2024, 4:24 PM

I tried to upgrade the firmware on sretest1003 to investigate this issue further but the cookbook doesn't seem to work as expected :

[...]
[IDRAC.2.8.PR19] Job completed successfully.
sretest1003 (NETWORK): now at version: 21.85.21.92
sretest1003 (NETWORK): Something went wrong, the current version (21.85.21.92) does not match the most target (22.71.3)

@Papaul @RobH could you upgrade sretest1003 (or any other host) to the most recent firmware version so I can have a closer look at this ?

I'd particularly like to look at DHCP option 93's value, to know if its still properly booting as BIOS. And maybe the issue has been fixed on more recent versions.

Thanks Papaul for upgrading the firmware.

Unfortunately the same issue still happens.

Screenshot from 2024-01-23 17-46-23.png (605×866 px, 66 KB)

For the record here is the full DHCP exchange :

16:43:26.271747 IP (tos 0x0, ttl 64, id 43351, offset 0, flags [none], proto UDP (17), length 635)
    ae1-1017.cr2-eqiad.wikimedia.org.bootps > install1004.wikimedia.org.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:62:0b:c8:9c:50 (oui Unknown), length 607, hops 1, xid 0xcc89c50, secs 4, Flags [Broadcast] (0x8000)
	  Gateway-IP ae1-1017.cr2-eqiad.wikimedia.org
	  Client-Ethernet-Address 00:62:0b:c8:9c:50 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Discover
	    Parameter-Request (55), length 24: 
	      Subnet-Mask (1), Time-Zone (2), Default-Gateway (3), IEN-Name-Server (5)
	      Domain-Name-Server (6), RL (11), Hostname (12), BS (13)
	      Domain-Name (15), SS (16), RP (17), EP (18)
	      Vendor-Option (43), Server-ID (54), Vendor-Class (60), BF (67)
	      Unknown (128), Unknown (129), Unknown (130), Unknown (131)
	      Unknown (132), Unknown (133), Unknown (134), Unknown (135)
	    MSZ (57), length 2: 1260
	    GUID (97), length 17: 0.68.69.76.76.89.0.16.75.128.78.180.192.79.77.87.51
	    ARCH (93), length 2: 0
	    NDI (94), length 3: 1.2.1
	    Vendor-Class (60), length 32: "PXEClient:Arch:00000:UNDI:002001"
	    Agent-Information (82), length 56: 
	      Circuit-ID SubOption 1, length 41: asw2-a-eqiad:xe-4/0/45.0:private1-a-eqiad
	      Remote-ID SubOption 2, length 11: xe-4/0/45.0
	    END (255), length 0
	    PAD (0), length 0, occurs 154
	    Hostname (12), length 20: "M-i^P^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
	    PAD (0), length 0, occurs 22
	    Subnet-Mask (1), length 192:  [|bootp]
16:43:26.272267 IP (tos 0x0, ttl 64, id 39864, offset 0, flags [DF], proto UDP (17), length 469)
    install1004.wikimedia.org.bootps > ae1-1017.cr2-eqiad.wikimedia.org.bootps: [bad udp cksum 0x76b0 -> 0x2baa!] BOOTP/DHCP, Reply, length 441, hops 1, xid 0xcc89c50, secs 4, Flags [Broadcast] (0x8000)
	  Your-IP sretest1003.eqiad.wmnet
	  Server-IP install1004.wikimedia.org
	  Gateway-IP ae1-1017.cr2-eqiad.wikimedia.org
	  Client-Ethernet-Address 00:62:0b:c8:9c:50 (oui Unknown)
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Offer
	    Server-ID (54), length 4: install1004.wikimedia.org
	    Lease-Time (51), length 4: 43200
	    Subnet-Mask (1), length 4: 255.255.252.0
	    Default-Gateway (3), length 4: vrrp-gw-1017.eqiad.wmnet
	    Domain-Name-Server (6), length 4: recdns.anycast.wmnet
	    Domain-Name (15), length 11: "eqiad.wmnet"
	    RP (17), length 10: "/tftpboot/"
	    Vendor-Option (43), length 82: 209.25.112.120.101.108.105.110.117.120.46.99.102.103.47.116.116.121.83.49.45.49.49.53.50.48.48.210.53.104.116.116.112.58.47.47.97.112.116.46.119.105.107.105.109.101.100.105.97.46.111.114.103.47.116.102.116.112.98.111.111.116.47.98.111.111.107.119.111.114.109.45.105.110.115.116.97.108.108.101.114.47
	    Agent-Information (82), length 56: 
	      Circuit-ID SubOption 1, length 41: asw2-a-eqiad:xe-4/0/45.0:private1-a-eqiad
	      Remote-ID SubOption 2, length 11: xe-4/0/45.0
	    END (255), length 0

So the option 93 is set to 0 as it should.

No real idea of where to look at it more. Could be worth following up with Dell/Broadcom support.

@Papaul @RobH what's the best path to follow up with Dell/Broadcom ?