Page MenuHomePhabricator

Supermicro: unable to set boot order after using Redfish to boot once
Open, LowPublic

Description

We are having difficulties setting the UEFI boot order on our, "Supermicro X12DSC-A6 BIOS Date:07/11/2024 Rev:2.1", hosts. On a normal boot we are able to add boot entries and set the boot order. However, when we issue a one time boot via Redfish updates are lost:

Normal

  1. Boot once disabled
'BootSourceOverrideEnabled': 'Disabled'
  1. Boot into Linux
  1. Using efibootmgr add a new entry, which is added first in the boot order
  1. Reboot
  1. New boot order is preserved

Boot once

  1. Boot once set:
"BootSourceOverrideEnabled": "Once",
"BootSourceOverrideTarget": "Pxe", # or Hdd
"BootSourceOverrideMode": "UEFI",
  1. Boot into Linux
  1. Using efibootmgr add a new entry, which is added first in the boot order
  1. Reboot
  1. New boot order is lost, though new entry remains

Ticket on Supermicro side: #FAV-941-81182

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-12-10T20:04:38Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be1088.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2024-12-10T20:04:52Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be1088.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2024-12-10T20:28:08Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 4:00:00 on ms-be1088.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2024-12-10T20:28:11Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be1088.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2024-12-10T22:54:18Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1088.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2024-12-10T22:54:21Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1088.eqiad.wmnet with reason: T381919

Per IRC conversation with @elukey , I just wanted to let y'all know that I successfully reimaged cloudelastic1012 just now. No Puppet 5, no CSR or any other errors.

Mentioned in SAL (#wikimedia-operations) [2025-01-30T20:27:51Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-01-30T21:57:36Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-01-31T17:52:10Z] <jhathaway@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sretest1001.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-01-31T17:59:46Z] <jhathaway@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sretest1002.eqiad.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-01-31T19:59:34Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Supermicro X12DSC-A6 system:

  1. Boot into Linux
  2. Copy an EFI bootable image, e.g. memtest.efi into /EFI/memtest/memtest.efi
  3. Use efibootmgr to add memtest.efi as a boot entry and first in the boot order
  4. reboot
  5. *system does not boot into memtest*

What appears to happen after rebooting, before step (5):

  1. The boot order is not preserved.

    In fact it appears it is not possible to alter the boot order through the operating system UEFI interface, or through the Redfish API.
  1. The boot order does not include Hard drive entries directly as under a typical UEFI system. Instead, adding "Hard Drive" to the fixed boot order delegates to the "UEFI Hard Disk Drive BBS Priorities" list to determine which UEFI boot entry to try.
  1. Entries must be manually added to the "UEFI Hard Disk Drive BBS Priorities" list?
  1. It is not possible to alter the "UEFI Hard Disk Drive BBS Priorities" from within Linux?

Mentioned in SAL (#wikimedia-operations) [2025-02-20T20:10:54Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-20T22:06:21Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-21T17:25:44Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-21T20:31:06Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-21T22:31:46Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-24T21:07:59Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-25T17:37:29Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-25T20:25:20Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-25T22:04:44Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-25T23:27:18Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-02-26T17:26:44Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202502271549_jhathaway_3741735_ms-be2088.out, asking the operator what to do
    • First Puppet run failed and the operator aborted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2088.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Current reproduction

Prerequisites

  1. Using the virtual media support connect a SystemRescue ISO hosted on an HTTP server, https://www.system-rescue.org/
  1. Configure the ATEN Virtual CDROM to be first in the boot order
  1. Deploy files to boot SystemRescue via HTTP boot on http server
/srv/systemrescue/
└── x86_64
    ├── airootfs.sfs
    ├── amd_ucode.img
    ├── intel_ucode.img
    ├── sysresccd.img
    └── vmlinuz
  1. Configure dhcp
host ms-be2088 {
    host-identifier option agent.circuit-id "lsw1-d7-codfw:xe-0/0/14.0:private1-d7-codfw";
    fixed-address 10.192.42.7;
    filename "http://208.80.154.10/efiboot-systemrescue/snponly.efi";
    option vendor-class-identifier "HTTPClient";
}
  1. Configure iPXE on http server
/srv/efiboot-systemrescue/
├── autoexec.ipxe
└── snponly.efi

$ cat /srv/efiboot-systemrescue/autoexec.ipxe
#!ipxe

set extrabootoptions initrd=amd_ucode.img initrd=intel_ucode.img initrd=sysresccd.img ip=dhcp net.ifnames=0 BOOTIF=01-${netX/mac} console=tty0 console=ttyS1,115200n8
kernel http://apt.wikimedia.org/systemrescue/x86_64/vmlinuz archiso_http_srv=http://apt.wikimedia.org/ archisobasedir=systemrescue ${extrabootoptions} loglevel=3
initrd http://apt.wikimedia.org/systemrescue/x86_64/amd_ucode.img
initrd http://apt.wikimedia.org/systemrescue/x86_64/intel_ucode.img
initrd http://apt.wikimedia.org/systemrescue/x86_64/sysresccd.img

boot

UEFI Boot reproduction

  1. Boot off of the ATEN CDROM into SystemRescue
  1. List current UEFI boot entries, NOTE: ATEN Virtual CDROM loaded with SystemRescue is first in boot order
$ efibootmgr -v | grep -i order
$ efibootmgr -v | grep -i aten
  1. Using Redfish force UEFI HTTP boot once to SystemRescue
http_boot = {
    "Boot": {
        "BootSourceOverrideMode": "UEFI",
        "BootSourceOverrideEnabled": "Once",
        "BootSourceOverrideTarget": "Pxe",
    }
}
r.request("patch", "/redfish/v1/Systems/1", json=http_boot)
  1. Reboot via UEFI HTTP boot in SystemRescue
  1. Manually add UEFI boot entry for memtest and place first in order
$ mount /dev/sda1 /mnt
$ cp /boot/memtest86+/memtest.efi /mnt/EFI/
$ umount /mnt
$ efibootmgr -c -d /dev/sda1 -L memtest -l '\EFI\memtest.efi' -v
  1. Reboot, expecting to boot into memtest, instead boot back into SystemRescue

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye completed:

  • ms-be2088 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202502281603_jhathaway_213661_ms-be2088.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502281632_jhathaway_213661_ms-be2088.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-05-13T17:41:49Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-05-13T19:40:29Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-05-14T16:11:36Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Met with Brian Nguyen from Supermicro on a video call. On the video call
I demoed the reproducer:

  1. Booted off a virtual CDROM into SystemRescue and removed all hard drive boot entries. Then added an entry for Memtest and placed it first in the boot order. Rebooted and Memtest booted as expected.
  2. Booted off of UEFI HTTP Boot and removed all hard drive boot entries. Then added an entry for Memtest and placed it first in the boot order. Rebooted and instead of booting into Memtest the machine booted off of the virtual CDROM into SystemRescue, thus displaying the bug.

Brian said he would speak with the Redfish developers and then get back
to me.

Mentioned in SAL (#wikimedia-operations) [2025-05-14T21:47:44Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-05-14T22:10:55Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-05-22T19:23:40Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919

Mentioned in SAL (#wikimedia-operations) [2025-05-22T21:13:04Z] <jhathaway@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919