Page MenuHomePhabricator

Supermicro: UEFI HTTP boot request hangs on cold boot
Open, LowPublic

Description

We configured our test server (a Supermicro SYS-110P-WTR) running the latest Bios (2.1) to boot over UEFI HTTP.
The server boots, then sends a DHCP request, the DHCP reply contains the "filename" option set to https://apt.wikimedia.org/efiboot/snponly.efi
This file is a vanilla iPXE image, the most recent build you can find on http://boot.ipxe.org/.
The issue is that the image download gets stuck, and nothing happens. Attached you can find a packet capture of the image download, which seems to indicate a bug in the UEFI TCP stack.
Then, after issuing a reboot, the download works fine. Attached you can find a packet capture of the working exchange. Multiple reboots will most likely work, but starting from a "cold boot" will always get stuck.

Supermicro ticket: #CYC-480-35337

Event Timeline

jhathaway triaged this task as Low priority.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors:

  • sretest2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors:

  • sretest2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors:

  • sretest2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors:

  • sretest2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm completed:

  • sretest2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501072319_jhathaway_1222859_sretest2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors:

  • sretest2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors:

  • sretest2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

After some testing, it appears this bug only manifests itself in conjunction with serving a filename from DHCP which requires DNS resolution.

  • failure: filename "http://apt.wikimedia.org/efiboot/snponly.efi";
  • success: filename "http://208.80.154.10/efiboot/snponly.efi";

Steps to reproduce:

  1. Setup a DHCP config with a filename URL which requires a DNS lookup, e.g. filename "http://apt.wikimedia.org/efiboot/snponly.efi";
  2. Power off: stop /system1/pwrmgtsvc1
  3. Power on: start /system1/pwrmgtsvc1
  4. Select UEFI HTTP boot
  5. Download of snponly.efi hangs

If a DNS lookup is not required that bug does not manifest itself. In addition if a power reset is performed, rather than a cold boot, the bug is not triggered.

When the bug is triggered, no HEAD request is made for the DHCP filename URL prior to performing a GET on the URL. Whereas in a successful boot a HEAD request is made.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm completed:

  • sretest2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501081705_jhathaway_1407973_sretest2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-05-20T19:40:31Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173

Mentioned in SAL (#wikimedia-operations) [2025-05-22T15:16:23Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173

Mentioned in SAL (#wikimedia-operations) [2025-07-01T19:00:34Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173

Mentioned in SAL (#wikimedia-operations) [2025-07-01T20:03:30Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173

Mentioned in SAL (#wikimedia-operations) [2025-09-05T15:13:20Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173

Mentioned in SAL (#wikimedia-operations) [2025-09-05T17:16:46Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173

@jhathaway Mortiz has kindly been able to drain ganeti2033 for us, so I think we can do the mirror port setup and use that host to take a full tcpdump.

@Papaul @Jhancock.wm would it be possible to connect the second SFP+ port on ganeti2033 with a 10G DAC to port xe-0/0/43 on lsw1-b7-codfw for us to do a quick test? Don't need it to be documented in Netbox or anything else it's just to do a very quick capture. Test will only take a few hours we can remove the cable any time after that.

@cmooney sorry for the late reply. connection made.

@cmooney sorry for the late reply. connection made.

Ah that's awesome Jenn thanks. We'll do our tests and advise when it can be removed, thanks!

Mentioned in SAL (#wikimedia-operations) [2025-09-22T16:10:54Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173

@cmooney sorry for the late reply. connection made.

Ah that's awesome Jenn thanks. We'll do our tests and advise when it can be removed, thanks!

@JennH we are all done with that now if you want to remove the cable. Thanks!

@jhathaway just to note down those observations from the pcaps yesterday. I don't think our conclusion will have changed.

In the capture where things don't work we observe:

  1. The host does not send the HTTP HEAD request, which it does on the others
    1. If the switch somehow was dropping this packet normal TCP mechanisms should cause it to be resent
    2. We see nothing of the sort, or the port dropping, the only logical interpretation is the client is not sending it
  2. The host does send a GET request, and our apt server starts sending the file in response to this
  3. The client begins to ACK the packets containing the image file, however its reported TCP receive window decreases steadily from 65k in the ACKs
    1. This indicates that while the TCP stack on the host is working, its buffer is likely filling up, the HTTP client is not properly reading the received bytes
    2. Once the host reports its receive window is 0 bytes our apt server has to stop sending the file, and thus the download does not complete

It seems almost certain this is some bug in their HTTP client, presumably they do the HEAD request to set up the file read at the application layer, and when this fails to happen the client does not properly initialise itself to read the downloaded file.

Mentioned in SAL (#wikimedia-operations) [2025-10-20T20:13:29Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173

Mentioned in SAL (#wikimedia-operations) [2025-10-23T16:05:05Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173

It seems almost certain this is some bug in their HTTP client, presumably they do the HEAD request to set up the file read at the application layer, and when this fails to happen the client does not properly initialise itself to read the downloaded file.

Yep, fully agree with that !

Mentioned in SAL (#wikimedia-operations) [2025-12-01T20:13:54Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sretest2001.codfw.wmnet with reason: T383173