Page MenuHomePhabricator

No disk boot option when moving ms-be2078 to UEFI
Closed, ResolvedPublic

Description

Hi folks!

In the parent task we are investigating an issue while reimaging Supermicro Config J hosts with UEFI, and we thought to test one Dell equivalent to see if the issue was the same.

As part of this process, I tried provisioning on ms-be2078 and it failed, so I realized that there was a bug: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1194892

When testing the above patch, I found out that the host when switched to UEFI doesn't list any HD/SATA/disk option in the Boot sequence anymore, only NICs. I checked it manually in the BIOS via racadm console, and I can confirm the odd behavior. Is there anything that you can think of to fix this problem?

Thanks in advance!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey triaged this task as Medium priority.Mon, Oct 13, 2:18 PM

@elukey looks like the drives will need to be wiped before you can convert to uefi. if they are setup in BIOS originally, they will not be visible in UEFI.

@elukey FWIW, feel free to wipe these disks (the host isn't in the swift rings ATM).

@Jhancock.wm I am a bit confused on what to do. What do you mean by wiping in this case? Is there anything to do in the BIOS menu or something else?

IIUC this may be happening - the controller is configured to run with disks in non-raid mode etc.. when BIOS/Legacy is selected, and then (for some reason) when switching to UEFI it is not recognized/eligible to be picked up as boot option.

Has it happened in the past? I am wondering if this may be also related to a firmware/idrac that needs to be upgraded..

i'm honestly not sure if it's a data retention safe guard or a limitation of the hardware. but if you configure disks in bios mode, you have to clear the config and delete everything from the drives to change them from BIOS to UEFI. Everything i could find on switching between boot modes said that all the data would be lost. The first step was always "backup the data".

We really don't run into this issue that often because we don't normally change boot mode part way through the server's life. It's not a firmware limitation. but we should update the idrac regardless. it is an older version. But I wouldn't expect that to fix the issue with the drives.

@Jhancock.wm if you have time I'd ask you some help to try both:

  • Clear the disk config: we can do it since the host is depooled, I have no idea what the procedure is, I think you'd be way faster than me. Once we find the procedure we can document and anybody will be able to perform it on the other swift nodes when we'll migrate them to UEFI.
  • New firmwares: I am not 100% sure how to get Dell idrac+bios firmwares for R740xd2, I see that we have something on cumin2002's /srv/firmware but not sure which one to select. Could you please select/download the right firmwares on cumin2002 for an upgrade?

Thanks a lot :)

can you please provide me with some context here on what we are trying to do, The only thing I see in the task is we are testing UEFI mode on the node.
1- Are we moving from Debain 11 to Debian 12?
2- What partman recipe are we using for testing?

I need the above info to troubleshoot,
Thanks

@Papaul the issue comes before debian and partman, because when I try to provision the host there is no "hard-disk" option to put as primary boot source, only the PXE ones. In Legacy everything works as expected, I can see the SATA disks. I totally get that reimaging with UEFI will require different partman recipes etc, but what I'd expect from the boot menu is to find the PXE options and a HDD/SATA/RAID/etc.. option to select as primary/first boot option. I thought it was a problem of the cookbook, but this happen even if I manually flip the boot mode in the BIOS menu. It seems as if the controller is not usable after the flip to UEFI, for some reason that I can't explain.

(to answer the question - like all ms-* nodes, this will continue to be Debian 11 for now, although we might use it for a test install of Debian 13 before its returned to service; it's partman/custom/ms-be_simple-efi.cfg or partman/custom/ms-be_simple.cfg as appropriate for UEFI/BIOS booting)

@Papaul @Jhancock.wm I went into System Setup (F2) -> Device -> Raid controller and used the erase function on both 480GB SSDs, cleared all controller caches etc.. Rebooted and the UEFI boot settings still listed only the NICs in the list of available boot options.

If I have to bet, we need to upgrade idrac+BIOS firmwares to get a decent UEFI implementation. So unless you have a better idea, I'd proceed with the upgrade and retest. I'd need the last/correct firmwares on cumin2002 first, not sure how/where to retrieve them, so any help from DCops would be awesome :)

@elukey @MatthewVernon thank you that was very helpful information. Now I can answer you question
"In UEFI Boot Mode, fixed media (see Hard Disk items in the earlier section) may or may not be added to the
boot sequence. Unlike legacy Boot Mode, in UEFI Boot Mode, the OS has the ability to add to and modify the
boot sequence"

https://dl.dell.com/manuals/all-products/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/dell-management-solution-resources_white-papers12_en-us.pdf

@Papaul this is true, the debian installer is the one that eventually sets the proper boot disk, but in all other models we have a generic disk/sata/RAID option in the System Settings, except this one. I don't think it is just what you pointed out, but something more related to the BMC itself (this is why I am suggesting the firmware upgrade). Shall we try to do it and see if/what changes?

@elukey on can you please provide me with one of the node that is working like you said so i can check what is different from this node and the one that is not working?

@Papaul this is the first dell config j that we flip to UEFI :)

@elukey i think the next step will be to try to install the OS without setting up the boot disk and let the OS take care of it. maybe this is one of the many cases where it is not possible to set out the boot disk before the OS install
Thanks.

While trying to use the firmware upgrade cookbook with "sudo cookbook sre.hardware.upgrade-firmware ms-be2078 --new" i get the error below so i have to to run the cookbook by passing the flag for each component
"sudo cookbook sre.hardware.upgrade-firmware ms-be2078 -c bios --new " this works only for the BIOS and when doing the same for the IDRAC i get the second error below.
Is it possible please to look into the code and see why this is failing? In the main time i was able to manually upgrade the IDRAC. Thanks

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
              ^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1083, in run
    failures += self._run_host(hostname)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1113, in _run_host
    self.update_idrac(redfish_host, netbox_host)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 710, in update_idrac
    target_version, job_id = self._update(
                             ^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 612, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 573, in _cached_select_firmwarefile
    return self._select_firmwarefile(*args, **kargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 567, in _select_firmwarefile
    return self.get_latest(product_slug, driver_type, driver_category)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 269, in get_latest
    product = self.dell_api.fetch(product_slug)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/__init__.py", line 327, in fetch
    raise DellAPIError("Unable to fetch dell drivers") from error
cookbooks.sre.hardware.DellAPIError: Unable to fetch dell drivers
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-10-22 22:03:16.816269', 'owner': 'pt1979@cumin2002 [4095385]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078']
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
              ^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1083, in run
    failures += self._run_host(hostname)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1113, in _run_host
    self.update_idrac(redfish_host, netbox_host)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 710, in update_idrac
    target_version, job_id = self._update(
                             ^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 612, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 573, in _cached_select_firmwarefile   
    return self._select_firmwarefile(*args, **kargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 568, in _select_firmwarefile
    return extract_version(selection), cast(Path, selection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/__init__.py", line 53, in extract_version
    raise RuntimeError(f'unable to extract version from: {firmware_file}')
RuntimeError: unable to extract version from: /srv/firmware/poweredge-r740xd2/IDRAC/iDRAC_7.00.00.183_A00.exe
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-10-22 22:32:43.721017', 'owner': 'pt1979@cumin2002 [4101938]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078']
pt1979@cumin2002:/srv/firmware/poweredge-r740xd2/IDRAC$

Change #1194892 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: remove boot order config in UEFI for Dells

https://gerrit.wikimedia.org/r/1194892

Thanks a lot for the firmware upgrades! I'll check what's wrong with the cookbook, afaics it seems something related to the input file.

I checked in the Boot menu and there was no improvement after the firmware upgrades, so I modified the provision cookbook to avoid setting a disk option if no-one is found. Debian installed fine and efibootmgr seems perfect:

elukey@ms-be2078:~$ efibootmgr
BootCurrent: 0004
BootOrder: 0004,0001,0000
Boot0000* Embedded NIC 1 Port 1 Partition 1	VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0001* NIC in Slot 3 Port 1 Partition 1	VenHw(986d1755-b9d0-4f8d-a0da-d1db18672045)
Boot0002* Hard drive C:	VenHw(d6c0639f-c705-4eb9-aa4f-5802d8823de6)feff1800000000000000000000000104f00000c802000000cc0000c8a60100c800000000000000000000000000000000000000000000000000000000000000000000000000110000000000000000001c0002010c00d041030a010000000101060000020101060000007fff0400500045005200430020004800370033003000500020004d0069006e0069002800620075007300200031003800200064006500760020003000300029000000
Boot0003* BRCM MBA Slot AF00 v218.0.219.1	BBS(128,BRCM MBA Slot AF00 v218.0.219.1,0x0)feffaf00000000000000000000000200b00080cf80000000200180cf900080cf00000000000000000000000000000000000000000000000000000000000000000000000000120000020000000000001c0002010c00d041030a080000000101060000000101060000007fff04004200520043004d0020004d0042004100200053006c006f00740020004100460030003000200076003200310038002e0030002e003200310039002e0031000000
Boot0004* debian	HD(1,GPT,ab1314a1-8939-410f-83c0-e74ead64d3e5,0x800,0x79800)/File(\EFI\debian\grubx64.efi)
MirroredPercentageAbove4G: 0.00
MirrorMemoryBelow4GB: false

I generalized the change for the provision cookbook with https://gerrit.wikimedia.org/r/1194892, we can skip to set the boot order option when provisioning with UEFI.

Thanks for the help @Papaul

Change #1194892 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: remove boot order config in UEFI for Dells

https://gerrit.wikimedia.org/r/1194892

While trying to use the firmware upgrade cookbook with "sudo cookbook sre.hardware.upgrade-firmware ms-be2078 --new" i get the error below so i have to to run the cookbook by passing the flag for each component
"sudo cookbook sre.hardware.upgrade-firmware ms-be2078 -c bios --new " this works only for the BIOS and when doing the same for the IDRAC i get the second error below.
Is it possible please to look into the code and see why this is failing? In the main time i was able to manually upgrade the IDRAC. Thanks

[..]

RuntimeError: unable to extract version from: /srv/firmware/poweredge-r740xd2/IDRAC/iDRAC_7.00.00.183_A00.exe

The cookbook expects this format for the IDRAC firmware file: 'IDRAC': r'(?P<version>(\d{1,2}\.){3}\d{1,2})_\w{3}$',

elukey@cumin2002:/srv/firmware/poweredge-r740xd2/IDRAC$ ls
iDRAC_7.00.00.183_A00.exe						iDRAC-with-Lifecycle-Controller_Firmware_T9J9H_WN64_6.10.30.20_A00.EXE
iDRAC-with-Lifecycle-Controller_Firmware_C8NT1_WN64_6.10.30.00_A00.EXE	iDRAC-with-Lifecycle-Controller_Firmware_VP556_WN64_7.00.00.183_A00.EXE

@Papaul iDRAC_7.00.00.183_A00.exe seems to had a odd naming, different from all the other files. Should we just use iDRAC-with-Lifecycle-Controller_Firmware_VP556_WN64_7.00.00.183_A00.EXE ? Not sure why there are two. Anyway, the file has a name that the cookbook cannot parse so it fails..

Change #1198355 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hardware: improve Dell IDRAC's version pattern

https://gerrit.wikimedia.org/r/1198355

Ok so turned out that the aforementioned file was just a test, but iDRAC-with-Lifecycle-Controller_Firmware_VP556_WN64_7.00.00.183_A00.EXE fails as well.

Should be fixed after https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1198355

Change #1198355 merged by Elukey:

[operations/cookbooks@master] sre.hardware: improve Dell IDRAC's version pattern

https://gerrit.wikimedia.org/r/1198355

elukey claimed this task.