Page MenuHomePhabricator

PXE boot NIC firmware regression
Closed, ResolvedPublic

Description

Follow up from the long investigation in T303776#7781198 and T303776#7797564.

We finally found the root cause and @Papaul identified the faulty firmware version:

Downgrade NIC firmware on cloudvirt1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the Failed to load ldlinux.c32 issue

So I'm not a PXE expert, so I don't know if the issue can be solved by updating the lpxelinux.0 binary, but at this point we should:

  • Follow up with the vendor so they can provider a fixed firmware (or guidance on how to workaround the issue)
  • Check if there is any server running the faulty version and downgrade the firmware (or at least warn the service owner)
  • List all the servers using the same NIC so we make sure to not upgrade them (and see the scale of the potential issue)

Event Timeline

@Papaul, cloudvirt1024 reports using version 21.85.21.92 in PuppetDB, so is that version a good one too? Is there a firmware changelog somewhere?

The 22.0 firmware is in fact really fresh, it was only released two weeks ago: https://www.dell.com/support/home/en-uk/drivers/driversdetails?driverid=npnt5&lwp=rt
I think it's worth reporting, maybe it's a regression that others have also seen.

Fixes:
- Fix performance concerns in Multicast streaming environment.

Enhancements:
- Add firmware upgrade support for new adapters.
- Change Number of MSI-X Vectors per VF default to 8 for 5750x adapters.

@ayounsi cloudvirt1024 is a R440 with version 22.00.07.60 we are getting the "Failed to load ldlinux.c32" with version 21.60.22.11 we are getting the error below

Network autoconfiguration failed                    
   │ Your network is probably not using the DHCP protocol. Alternatively,  │    
   │ the DHCP server may be slow or some network hardware is not working   │    
   │ properly.

on the other hand cloudvirt1025 and 1026 are R640 getting the same error with version 22.00.07.60 but working with version 21.60.22.11
the next version that is available for R640 after 21.60.22.11 is 21.85.21.92 same as the one running on cloudvirt1024 . I was hoping to test 21.85.21.92 on coudvirt1027 today to make sure it works on R640 so that we can have all the servers running the 21.85.21.92 version but looks like @Andrew already re-imaged cloudvirt1027

It looks like stat1010 might also be affected by this issue. I've had successive PXE boot failures from the sre.hosts.reimage cookbook,
DHCP is successful and the correct DHCP fragment appears to be in place on install1003.

image.png (318×714 px, 30 KB)

btullis@install1003:/etc/dhcp$ cat automation/ttyS1-115200/stat1010.conf

host stat1010 {
    host-identifier option agent.circuit-id "lsw1-e1-eqiad:xe-0/0/24.0:analytics1-e1-eqiad";
    fixed-address 10.64.138.6;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";
}

I'm investigating firmware levels now.

This also has version 22.00.07.60 installed to the NIC.

image.png (406×733 px, 30 KB)

According to Dell this still appears to be the latest version of the firmware.
I will try downloading to 21.85.21.92 as suggested ealier in this ticket.

Yes, it worked! I didn't make any other changes, apart from downgrading 22.00.07.60 to 21.85.21.92 and now it has obtained the ldlinux.c32 file over TFTP.
Please feel free to let me know if there's anything else that I can usefully do to help reporting this issue to Dell or remediate the issue across any other new servers.

I used the virtual media option within the iDRAC, so I haven't made the firmware available by TFTP/HTTP anywhere. If it would help I can write up the method I used.

Hi folks! I got a problem when tring to reimage kafka-main1005 to Bullseye, and Moritz told me that there should be a new NIC firmware to test :)

joanna_borun raised the priority of this task from Low to Medium.Jan 22 2024, 4:24 PM

I tried to upgrade the firmware on sretest1003 to investigate this issue further but the cookbook doesn't seem to work as expected :

[...]
[IDRAC.2.8.PR19] Job completed successfully.
sretest1003 (NETWORK): now at version: 21.85.21.92
sretest1003 (NETWORK): Something went wrong, the current version (21.85.21.92) does not match the most target (22.71.3)

@Papaul @RobH could you upgrade sretest1003 (or any other host) to the most recent firmware version so I can have a closer look at this ?

I'd particularly like to look at DHCP option 93's value, to know if its still properly booting as BIOS. And maybe the issue has been fixed on more recent versions.

Thanks Papaul for upgrading the firmware.

Unfortunately the same issue still happens.

Screenshot from 2024-01-23 17-46-23.png (605×866 px, 66 KB)

For the record here is the full DHCP exchange :

16:43:26.271747 IP (tos 0x0, ttl 64, id 43351, offset 0, flags [none], proto UDP (17), length 635)
    ae1-1017.cr2-eqiad.wikimedia.org.bootps > install1004.wikimedia.org.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:62:0b:c8:9c:50 (oui Unknown), length 607, hops 1, xid 0xcc89c50, secs 4, Flags [Broadcast] (0x8000)
	  Gateway-IP ae1-1017.cr2-eqiad.wikimedia.org
	  Client-Ethernet-Address 00:62:0b:c8:9c:50 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Discover
	    Parameter-Request (55), length 24: 
	      Subnet-Mask (1), Time-Zone (2), Default-Gateway (3), IEN-Name-Server (5)
	      Domain-Name-Server (6), RL (11), Hostname (12), BS (13)
	      Domain-Name (15), SS (16), RP (17), EP (18)
	      Vendor-Option (43), Server-ID (54), Vendor-Class (60), BF (67)
	      Unknown (128), Unknown (129), Unknown (130), Unknown (131)
	      Unknown (132), Unknown (133), Unknown (134), Unknown (135)
	    MSZ (57), length 2: 1260
	    GUID (97), length 17: 0.68.69.76.76.89.0.16.75.128.78.180.192.79.77.87.51
	    ARCH (93), length 2: 0
	    NDI (94), length 3: 1.2.1
	    Vendor-Class (60), length 32: "PXEClient:Arch:00000:UNDI:002001"
	    Agent-Information (82), length 56: 
	      Circuit-ID SubOption 1, length 41: asw2-a-eqiad:xe-4/0/45.0:private1-a-eqiad
	      Remote-ID SubOption 2, length 11: xe-4/0/45.0
	    END (255), length 0
	    PAD (0), length 0, occurs 154
	    Hostname (12), length 20: "M-i^P^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
	    PAD (0), length 0, occurs 22
	    Subnet-Mask (1), length 192:  [|bootp]
16:43:26.272267 IP (tos 0x0, ttl 64, id 39864, offset 0, flags [DF], proto UDP (17), length 469)
    install1004.wikimedia.org.bootps > ae1-1017.cr2-eqiad.wikimedia.org.bootps: [bad udp cksum 0x76b0 -> 0x2baa!] BOOTP/DHCP, Reply, length 441, hops 1, xid 0xcc89c50, secs 4, Flags [Broadcast] (0x8000)
	  Your-IP sretest1003.eqiad.wmnet
	  Server-IP install1004.wikimedia.org
	  Gateway-IP ae1-1017.cr2-eqiad.wikimedia.org
	  Client-Ethernet-Address 00:62:0b:c8:9c:50 (oui Unknown)
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Offer
	    Server-ID (54), length 4: install1004.wikimedia.org
	    Lease-Time (51), length 4: 43200
	    Subnet-Mask (1), length 4: 255.255.252.0
	    Default-Gateway (3), length 4: vrrp-gw-1017.eqiad.wmnet
	    Domain-Name-Server (6), length 4: recdns.anycast.wmnet
	    Domain-Name (15), length 11: "eqiad.wmnet"
	    RP (17), length 10: "/tftpboot/"
	    Vendor-Option (43), length 82: 209.25.112.120.101.108.105.110.117.120.46.99.102.103.47.116.116.121.83.49.45.49.49.53.50.48.48.210.53.104.116.116.112.58.47.47.97.112.116.46.119.105.107.105.109.101.100.105.97.46.111.114.103.47.116.102.116.112.98.111.111.116.47.98.111.111.107.119.111.114.109.45.105.110.115.116.97.108.108.101.114.47
	    Agent-Information (82), length 56: 
	      Circuit-ID SubOption 1, length 41: asw2-a-eqiad:xe-4/0/45.0:private1-a-eqiad
	      Remote-ID SubOption 2, length 11: xe-4/0/45.0
	    END (255), length 0

So the option 93 is set to 0 as it should.

No real idea of where to look at it more. Could be worth following up with Dell/Broadcom support.

@Papaul @RobH what's the best path to follow up with Dell/Broadcom ?

So today I wanted to instal ml-staging2003. This is a new SMC hardware type and it hits this problem.

I did some digging and discovered that we were using a very old version of pxelinux (v6.0.3 from 20150819), but Debian oldstable (Bullseye) is shipping 6.0.4 from 20200816.

Hoping that this might have been fixed, I temporarily replaced /srv/tftpboot/lpxelinux.0 with the newer version. Alas, this did not help. (Note: we're still planning on updating permanently, see T367970).

ayounsi (re)discovered the post at
http://marcoguerri.github.io/2016/03/20/pxeboot-failures-chelsio.html which explores a similar probelm on a different hardware/firmware combo. I hotpatched lpxelinux.0 to use the extremely cavalier "fix" at the bottom of that post, and this made the machine boot, albeit slowly.

Eventually, the install process completed, but naturally the hack I used isn't a fix --- for example we don't know if any of our other hw types would fail to install with it in place.

We need to report this to SMC/Dell and/or Broadcom as it is clearly a firmware regression. One could argue it's a bug in PXELinux, but we have source code access to that, but not the firmware, so we can't verify either way.

For anyone who wants to build the above binary form the Debian sources:

$ mkdir tmp
$ cd tmp
$ apt source syslinux
$ cd syslinux-6.04~git20190206.bf6db5b4+dfsg1
$ dpkg-buildpackage # this may complain about missing deps for building, install them
[... build will take a while ...]
$ vi core/fs/pxe/isr.c
  • find pxe_isr_poll(), it should be around line 200
  • its last line is return isr.FuncFlag == PXENV_UNDI_ISR_OUT_OURS;
  • change that line to say return 1
  • close vi, run make (possibly with -j $(nproc))

the file ./bios/core/lpxelinux.0 is all you need. Copy it to the install server at /srv/tftboot/ and you should be good.

Not that the file is managed by puppet (from volatile), so you need to disable Puppet on that installserver while you use the hacked file.

Hi @Papaul - can you add the Dell Support ticket that you created in this Phabricator task, and provide any updates/progress on how that's going? Thanks, Willy

Case 193419542 PXE Boot Issue | 4L70704 | R760XD2 | Debian

Hello Papaul,
I'm your case owner and primary point of contact through resolution of this issue.

Here are the best ways to contact me:

Email: shadab.akbar@dell.com (Preferred)

Direct Extension: 1-800-945-3355 x 6288028

My working hours:12PM to 9PM CDT Monday Thru Friday

I will contact you throughout the life of this case by email whenever possible.

Remember your satisfaction with our support is my responsibility; please inform me of any issues or concerns so I can address them immediately

Dell Tech Support
7:35 PM (2 hours ago)
to me

Hello Papaul,
According the system Bios, the Boot Mode is set to "Bios"
Lifecycle log has the following entry that is repeated several time:
"The boot mode is set to BIOS. Dell recommends to use the UEFI boot mode with Secure Boot for better security and advanced features. The BIOS boot mode can support only a limited set of Input/Output devices and OSs and is being deprecated in the industry."

I recommend changing the boot mode to "UEFI" and enable the PXE device under network settings in Bios.
Let me know if this resolves the issue or not.

Reply to Dell this morning @10:24am CT
Hello Shadab,

Thank you for pointing the BIOS settings out. We are aware of this and we are not yet ready to change to UEFI. However we have +1k nodes in our environment that use the same BIOS settings without and issue. Like I pointed out to you during our phone call this issue is on 10G NIC’s and not on 1G NIC’s. We have to downgrade the 10G NIC firmware from version 22 to version 21.87 for pxe boot to work.

Got a Call from Dell this afternoon at 18:27 and the engineer working on this case said he will escalate the issue to level 3

Papaul,
There is minimal PXE boot troubleshooting supported.
Is the card seen as as a boot device with the new firmware?
If the card is seen on the firmware when trying to PXE boot then there is not much support we can offer in terms of troubleshooting.
If the card is not seen, then it would be hardware issue which is unlikely because the issue is occuring on multiple systems.

Sorry, I wish I could be more helpful.

Hello Shadab,
Thank you for your reply. Let us not focus on the Hardware side because there is nothing wrong with the hardware like you mentioned. The issue is with the firmware on the hardware itself . Like I mentioned before, when the card is at version 22 the server will not pxe boot but when we downgrade the firmware on the card to 21.87 then pxe boot works. If it was a hardware issue version 21.87 will not work. 
Thanks. Regards

Hi Papaul,

My name is Eddie Vasquez and I am a Quality Coach for Dell Technologies. I apologize that PXE boot is not currently working when you update to the latest firmware version for the Broadcom NIC. By reverting the firmware to a functional version, we can confirm that there is no hardware problem. We will refer this matter to our hardware engineers who will conduct a thorough investigation. There may be delays but we will keep you posted and provide you with updates as they come along. If any information is needed other than the logs that you have provided, Shadab will engage and capture the logs. Please let me know if you have any questions or concerns.

Regards,

I got a reply from Supermicro today (I had been pestering them to pester Broadcom):

We just got a confirmation from Broadcom and they informed “ It is confirmed by our development team that Ipxelinux.0 is not supported on Broadcom devices. Customer has to use pxelinux.0 instead.”

So through that avenue, we won't get a fixed firmware, unfortunately.

Papaul claimed this task.

@klausman thank you for putting in the time and effort to contact the vendor on this issue, since we now have in place a workaround that works for us we can use it. I am resolving for now this task. Thanks