Page MenuHomePhabricator

UEFI installer not installing grub correctly (at least on systems where / is RAID)
Closed, ResolvedPublic

Description

I've been trying to install trixie onto sretest2010 (which was set up in T394357), and one of the problems I'm finding is that the installer isn't installing grub correctly, leading to a system that can't boot (or boots back into the installer). I did also find this with another ms-be node (ms-be1083). These use LVM RAID1 for /

The failure mode is that on reimage the node reboots after the installer has completed and fails into the grub rescue mode with an error like:

error: disk `mduuid/3207fa1071e844ffdc954a0ec74fddbd' not found.

The problem being that the mduuid is from a previous install. Alternatively, if you've wiped enough disks correctly (the key thing being to make sure that first partition of the two SSDs gets blanked) then after the first install, the system will attempt to boot from disk, fail, and boot back into the installer - and then succeed after that.

As best as I can tell, the installation is not correctly ensuring that the first or /boot/efi partition on both SSDs is written to (not surprisingly, I guess, given only one of them gets mounted), and so if the installer writes onto the "wrong" SSD and the system boots based off the other one, then it has the wrong mduuid embedded.

When watching the installer, it does say that it's doing "grub-install sdm sdn" or similar, so it _ought_ to be attempting to write to both disks. Likewise, if you manage to get one of these systems to boot from the rescue prompt and then run grub-install from the booted system, it then seems to work reliably.

It's not a problem on BIOS-booted systems (I think because there isn't a mounted /boot/efi involved?), but is going to be a real problem if/when we start trying to reimage a bunch of these swift backends that UEFI boot.

Event Timeline

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

Does /boot even need to be on a separate partition for UEFI booting?

Does /boot even need to be on a separate partition for UEFI booting?

No, however, the UEFI ESP partition does need to be on a separate partition with a FAT32 filesystem. The EFI firmware searches each drive for such a partition to discover EFI boot files. Debian only installs Grub on the ESP, so Grub in turn needs to be able to read the Linux kernel out of /boot. Grub does not care whether /boot is a separate partition or co-mingled with /, the main requirement is that the partition's filesystem is supported by Grub.

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

I spent some time trying to debug the woes with this host, but the behavior is very strange.

Things I tried.

  1. Reset Bios to optimized defaults
  2. Re-installed the same version of the Bios, while discarding all settings except SMBIOS
  3. Issued a cold reset to the BMC

But, none of my actions changed the behavior of the BMC, notably issuing a reset /system1/pwrmgtsvc1 or a stop /system1/pwrmgtsvc1 command do not seem to have any effect.

To recap, it seems that we have two problems:

  1. For some mysterious reasons, sretest2010 seems to have stopped working correctly at the BMC level (resets not happening, etc..). This is not great since we cannot easily test reimages, so we need to fix this problem first. Let's keep all BMC-related investigations in T394357.
  1. I had no problems installing the OS in T394357, the host was set with standard-efi + raid1-2dev-efi configs before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185973. And I don't recall this issue happening in any of the previous UEFI installs, so I am wondering if it is, for some reason, related to partman early command?
elukey triaged this task as Medium priority.Sep 29 2025, 2:51 PM

@elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* nodes, it's going to be a real pain for any OS upgrade (which is getting urgent given we're still on bullseye...). Which is not to say the problem doesn't lie with something I wrote in the partman setup!

@elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* nodes, it's going to be a real pain for any OS upgrade (which is getting urgent given we're still on bullseye...). Which is not to say the problem doesn't lie with something I wrote in the partman setup!

@MatthewVernon I totally agree, what I meant is to find a way to narrow down the possible source of the problem, not to dismiss your request :) Basically I'd like to test the host with its previous "standard" recipe again (when the hardware will work) to figure out if I missed the problem just because of luck, or if it can be reproducible also with standard recipes. We'll find a solution, reimaging in this condition is not great and painful.

Change #1190674 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

Change #1190674 merged by MVernon:

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

A couple of notes, so I have a record of what I've done, and in case they're of any help!

I've just re-imaged ms-be1086 and ms-be1087 (both UEFI), and blanked the partition mounted as /boot/efi before reimage (which subsequently proceded without problems). In both cases, there is after reimage a partition labelled as EFI System Partition on both system disks e.g.:

mvernon@ms-be1086:~$ sudo blkid /dev/sda1 /dev/sdb1
/dev/sda1: UUID="B2CC-32D0" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="2b78ca7b-2ff2-443e-bb43-3bcc6db6dfbd"
/dev/sdb1: UUID="B2CB-38A7" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="96277097-e3e2-4a08-be8a-93ae18d50c4c"

But the not-mounted partition has an empty filesystem (i.e. you can mount it, but it has nothing in). When watching the reimage, the installer does say something to the effect of "Running grub-install /dev/sda /dev/sdb" towards the end of the install process. But the result is seemingly an empty FS.

I looked at a system with standard-efi.cfg and raid1-2dev-efi.cfg - an-test-coord1002. As expected, it has /dev/sdb2 mounted as /boot/efi and also an EFI System Partition on /dev/sda2. I mounted it and compared the contents:

mvernon@an-test-coord1002:~$ sudo ls -l /mnt/EFI/debian/grubx64.efi
-rwxr-xr-x 1 root root 167936 Aug 21 22:20 /mnt/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo ls -l /boot/efi/EFI/debian/grubx64.efi
-rwx------ 1 root root 167936 Aug 22 17:25 /boot/efi/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo md5sum /mnt/EFI/debian/grubx64.efi
a2119e99fceafce1de3488c5ddbde073  /mnt/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo md5sum /boot/efi/EFI/debian/grubx64.efi
92f592110127ebea4829165012cff37e  /boot/efi/EFI/debian/grubx64.efi

The MOTD on this system tells me Debian GNU/Linux 12 auto-installed on Fri Aug 22 17:25:47 UTC 2025., which I think tells us that one EFI was setup during the install, and the other has been done later. This would tend to support the theory that the installer at the moment is not correctly writing the new EFI system partition to both system disks.

I had a look for obvious differences between the ms-be preseeding and standard-efi+raid1-2dev-efi setups. raid1-2dev-efi sets d-i grub-installer/only_debian boolean false with a comment referring to Debian #666974 which is long-closed. And the partitioning makes a small biosgrub partition (which I don't think is necessary any more).

I also found a couple of notes on archwiki and Debian wiki on the issues with EFI and systems doing software RAID1 for their system disks.

Finally, given the current problems with sretest2010 (T394357), I've delayed returning ms-be1088 to service so @elukey can do some more investigations with it.

@MatthewVernon thanks for the write-up! As FYI Jesse is working on T376949, that should address your concerns about the efi partition not being replicated. The thing that I don't get is why you see the error: disk mduuid/3207fa1071e844ffdc954a0ec74fddbd' not found.` error, because we never really got anything like that before.

My best theory on that is that one install run writes EFI to one disk (embedding the UUID), then a subsequent install run writes to the other disk (embedding the new UUID), leaving you with two EFI partitions for the hardware to "pick" to boot from, differing in the UUID they are looking for.

Trying to summarize the problem:

  • We know that the debian installer doesn't copy the EFI partition on all the disks in a sw raid setup. We have opened T376949, since so far the only issue that we had arose from disk failures (so the disk with the EFI partition populated breaks, and the other one can't boot).
  • I checked dse-k8s-worker1014, that runs with raid1-2dev-efi.cfg, but the non-mounted EFI partition on the other disk is not populated. So an-launcher1002 (mentioned above) has been probably done manually by someone.
  • I checked with Matthew and this issue is not always reproducible, sometimes it happens, sometimes things go fine.

I was also interested by:

The problem being that the mduuid is from a previous install. Alternatively, if you've wiped enough disks correctly (the key thing being to make sure that first partition of the two SSDs gets blanked) then after the first install, the system will attempt to boot from disk, fail, and boot back into the installer - and then succeed after that.

This seems to me a special case of the main one reported, since the wipe seems to lead to cleaner boot failure and triggers another PXE install. Shouldn't we have seen this issue more broadly across our fleet? It doesn't seem to be specific to some hosts, unless the disk controller of the swift hosts models plays a role during boot.

Mentioned in SAL (#wikimedia-operations) [2025-10-08T15:16:57Z] <elukey> reboot ms-be1088 as a test for T404356

I checked ms-be1088's boot properties and the disk boot option is debian(SATA,Port:0), that IIUC is being set by the Debian installer. It would be interesting to inspect this value when the issue occurs, to understand if it changed or not.

Matthew told me that ms-be2078 can be used for testing the reimage with UEFI, it is a Dell node with Legacy settings (so it needs to be reprovisioned, and its partman recipe needs to be updated).

Change #1194880 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] preseed: set ms-be2078 for UEFI

https://gerrit.wikimedia.org/r/1194880

Change #1194880 merged by Elukey:

[operations/puppet@production] preseed: set ms-be2078 for UEFI

https://gerrit.wikimedia.org/r/1194880

Tests on ms-be2078 are blocked by T406964 :(

While checking the BIOS/etc.. settings for ms-be2078 (Dell), I noticed that in the config util of the RAID controller there was a specific mention of what disk is marked as boot device (serial port combination), meanwhile I didn't find the same thing on ms-be1088 (Supermicro). I tried to look for T371400#10279452, I found a SAS 3816 config utility but I didn't manage to get into the same level of details, so there may be something that I am missing.

The next step is to test multiple reimages on ms-be2078 and see if we can repro, I have the feeling that what Matthew reported is a Supermicro-specific problem.

The next step is to test multiple reimages on ms-be2078 and see if we can repro, I have the feeling that what Matthew reported is a Supermicro-specific problem.

Finally ms-be2078 is running trixie on UEFI. I didn't manage to reproduce the issue, @MatthewVernon could you try to trigger it to see if we can repro or not? I am not sure how frequent the issue pops up, but so far I didn't see it happening. If this holds, the problem can be confined to Supermicros :)

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm completed:

  • ms-be2078 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510241053_mvernon_346972_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm completed:

  • ms-be2078 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510241149_mvernon_358839_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm completed:

  • ms-be2078 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510241304_mvernon_376462_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS trixie executed with errors:

  • ms-be2078 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510241429_mvernon_396040_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2078.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS trixie executed with errors:

  • ms-be2078 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510241517_mvernon_407154_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2078.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bookworm completed:

  • ms-be2078 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510241609_mvernon_421983_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I've not been able to reproduce the boot failure (except by cheating), but the underlying issue remains - the installer is installing the EFI System Partition onto only 1 of the two OS disks, and doesn't touch the equivalent partition on the other one. So as long as drive ordering (as seen by the installer) is consistent, everything is good. We've learned in the past that this isn't something to rely upon.

I reimaged ms-be2078 a bunch of times; every time the result is that of the two system disks, partition 1 of one of them gets the EFI System Partition on, and the other remains blank. If, though, I copy the contents across, then you end up with two EFI System Partitions, which end up bootable:

mvernon@ms-be2078:~$ efibootmgr
BootCurrent: 0004
BootOrder: 0004,0001,0000,0008
Boot0000* Embedded NIC 1 Port 1 Partition 1     VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0001* NIC in Slot 3 Port 1 Partition 1      VenHw(986d1755-b9d0-4f8d-a0da-d1db18672045)
Boot0002* Hard drive C: VenHw(d6c0639f-c705-4eb9-aa4f-5802d8823de6)feff1800000000000000000000000104f00000c802000000cc0000c8a60100c800000000000000000000000000000000000000000000000000000000000000000000000000110000000000000000001c0002010c00d041030a010000000101060000020101060000007fff0400500045005200430020004800370033003000500020004d0069006e0069002800620075007300200031003800200064006500760020003000300029000000
Boot0003* BRCM MBA Slot AF00 v218.0.219.1       BBS(128,BRCM MBA Slot AF00 v218.0.219.1,0x0)feffaf00000000000000000000000200b00080cf80000000200180cf900080cf00000000000000000000000000000000000000000000000000000000000000000000000000120000020000000000001c0002010c00d041030a080000000101060000000101060000007fff04004200520043004d0020004d0042004100200053006c006f00740020004100460030003000200076003200310038002e0030002e003200310039002e0031000000
Boot0004* debian        HD(1,GPT,0db125c3-48cd-46ef-8e21-adaaa2f5a933,0x800,0x79800)/File(\EFI\debian\grubx64.efi)
Boot0008* debian        HD(1,GPT,7a025aa1-92cf-4f77-9c69-35ff163f80ed,0x800,0x79800)/File(\EFI\debian\grubx64.efi)
MirroredPercentageAbove4G: 0.00
MirrorMemoryBelow4GB: false

Here Boot0004 is the correct drive, and Boot0008 is the other system drive. If you reimage, the contents of Boot0008 remain the same.

If you then wipe the filesystem of the normally-comes-first drive (0004) and reboot, then that reproduces the failures we see in some SM nodes - the system boots off the EFI partition, but then GRUB is looking for the mduuid of an old install, so can't proceed.

To illustrate further, from a couple of subsequent installs:

Boot0004* debian        HD(1,GPT,f12aadb3-0439-43bb-9d6b-3a0293e78e40,0x800,0x79800)/File(\EFI\debian\grubx64.efi)
Boot0008* debian        HD(1,GPT,a8691369-4128-4672-a285-74eca9f86751,0x800,0x79800)/File(\EFI\debian\grubx64.efi)
#mount 0008 as /mnt, md5:
72341ba53b9ba2fc285dbeaf122f74a3  /mnt/EFI/debian/grubx64.efi

And then on next install

Boot0004* debian        HD(1,GPT,0db125c3-48cd-46ef-8e21-adaaa2f5a933,0x800,0x79800)/File(\EFI\debian\grubx64.efi)
Boot0008* debian        HD(1,GPT,7a025aa1-92cf-4f77-9c69-35ff163f80ed,0x800,0x79800)/File(\EFI\debian\grubx64.efi)

# sdz1 still not the active one, mount it to /mnt and still:
72341ba53b9ba2fc285dbeaf122f74a3  /mnt/EFI/debian/grubx64.efi
# PARTUUID is different from install above
/dev/sdz1: UUID="9C09-5C1F" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="7a025aa1-92cf-4f77-9c69-35ff163f80ed"

So we see that whilst the installer has written a new partition table to /dev/sdz (hence the new PARTUUID), it's never actually touching the filesystem on the not-picked OS drive.

This is the same problem as I observed on the SM systems, but it seems not to cause problems with booting/reimaging in practice on the Dell system, as the "correct" drive is consistently appearing first in the EFI BootOrder.

[I've left ms-be2078 imaged to bookworm, as trixie is uninstallable due to puppet failures due to missing software]

@MatthewVernon thanks a lot for the tests! So I see two issues in this task:

  1. Debian install doesn't duplicate the ESP partition on two disks when using RAID for the OS/root. This is currently tracked in T376949, Jesse filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082288 but we didn't push for it till now. I think that we could find some consensus and apply that solution to make the current state better, and then think about alternatives (if we need/feel so in the future).
  1. Our dear Supermicro config J seems to show a different behavior from its Dell counterpart, namely the controller (??? still unclear) seems to not respect what d-i sets as first boot device in the EFI Boot ordering.

If you are ok and agree on the analysis, I'd keep this task for 2), while 1) for T376949. What do you think?

After a chat with Jesse it may be possible that this bug is a variant of what we have been chasing in T381919. The TL;DR is that in reimage with use a redfish functionality to "boot-once" via UEFI HTTP to trigger the Debian install, but a Supermicro host enters into a state where the EFI Boot settings applied by the Debian installer are not preserved after the reboot. The consequence of this behavior, that we have been seeing so far, is that the host boots again into UEFI HTTP after debian install and then it reboots again, ending up in another install and eventually in the reimage cookbook getting stuck when trying to run puppet (since the certs do not match anymore etc..). Another reimage usually fixes the issue. We have been working with Supermicro to fix this problem at the firmware level, but so far we didn't get anything concrete to test.

It may be possible that this same issue triggers the problem that Matthew outlined in this task on Config J hosts.

There are still some provisioning issues for sretest2010 (see T394357) but I was able to PXE-boot two times and install/reboot Trixie without seeing the issue mentioned in the task's description. I'll keep doing more reimages, but so far it seems that the issue is confined to Supermicro Config J hosts running the swift role (and affected by the partman early command).

There are still some provisioning issues for sretest2010 (see T394357) but I was able to PXE-boot two times and install/reboot Trixie without seeing the issue mentioned in the task's description. I'll keep doing more reimages, but so far it seems that the issue is confined to Supermicro Config J hosts running the swift role (and affected by the partman early command).

Tried to reimage again, there are some HTTP boot issues that we are trying to solve in T394357 but once d-i is triggered I don't see any issue after the reboot. At this point I'd like to get a Supermicro Config J host among the ms-be ones to try some reimages and also a reimage with a specific config for the host to skip configure_swift_disks() in d-i's early_partman settings. @MatthewVernon is there an ms-be host that I can use for tests?

There are still some provisioning issues for sretest2010 (see T394357) but I was able to PXE-boot two times and install/reboot Trixie without seeing the issue mentioned in the task's description. I'll keep doing more reimages, but so far it seems that the issue is confined to Supermicro Config J hosts running the swift role (and affected by the partman early command).

Tried to reimage again, there are some HTTP boot issues that we are trying to solve in T394357 but once d-i is triggered I don't see any issue after the reboot. At this point I'd like to get a Supermicro Config J host among the ms-be ones to try some reimages and also a reimage with a specific config for the host to skip configure_swift_disks() in d-i's early_partman settings. @MatthewVernon is there an ms-be host that I can use for tests?

If there is no hosts that can be used for testing I'll try to set up something with sretest2010, a little more painful at the moment but ok.

Tried to reimage again, there are some HTTP boot issues that we are trying to solve in T394357 but once d-i is triggered I don't see any issue after the reboot. At this point I'd like to get a Supermicro Config J host among the ms-be ones to try some reimages and also a reimage with a specific config for the host to skip configure_swift_disks() in d-i's early_partman settings. @MatthewVernon is there an ms-be host that I can use for tests?

Hi @elukey ms-be1088 is already reserved for you to use for this purpose.

Tried to reimage again, there are some HTTP boot issues that we are trying to solve in T394357 but once d-i is triggered I don't see any issue after the reboot. At this point I'd like to get a Supermicro Config J host among the ms-be ones to try some reimages and also a reimage with a specific config for the host to skip configure_swift_disks() in d-i's early_partman settings. @MatthewVernon is there an ms-be host that I can use for tests?

Hi @elukey ms-be1088 is already reserved for you to use for this purpose.

Perfect thanks, I wanted to be sure to avoid messing up with production!

Tried to reimage again, there are some HTTP boot issues that we are trying to solve in T394357 but once d-i is triggered I don't see any issue after the reboot. At this point I'd like to get a Supermicro Config J host among the ms-be ones to try some reimages and also a reimage with a specific config for the host to skip configure_swift_disks() in d-i's early_partman settings. @MatthewVernon is there an ms-be host that I can use for tests?

Hi @elukey ms-be1088 is already reserved for you to use for this purpose.

@elukey while I'm at it, you also have a Dell Config-J system for testing (ms-be2078, T406964); are you finished with that host now? It's fine if you still want it, I just don't want to forget about it :)

@elukey while I'm at it, you also have a Dell Config-J system for testing (ms-be2078, T406964); are you finished with that host now? It's fine if you still want it, I just don't want to forget about it :)

Yeah I am done with it!

I tried to reimage ms-be1088 3 times and everything worked as expected without an issue. I had a chat with Matthew and reproducing the issue seemed to vary between hosts, so it is not something that happen every time.

@MatthewVernon I am not able to reproduce on ms-be1088, at this point you can probably finish your maintenance to it and put it back in production. I spoke with Jesse yesterday and we are planning to rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082288 soon, so the missing sync between the EFI partitions in a sw raid should be hopefully better after it.

For the other issue, the grub mduuid problem, I'd restart to work on it the next time that you face it, so we'll hopefully have an easier repro host to work with. Let me know what you think about it.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1088.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1088.eqiad.wmnet with OS bullseye completed:

  • ms-be1088 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202512021529_mvernon_393035_ms-be1088.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1214100 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: restore ms-be1088 to the rings

https://gerrit.wikimedia.org/r/1214100

Change #1214100 merged by MVernon:

[operations/puppet@production] swift: restore ms-be1088 to the rings

https://gerrit.wikimedia.org/r/1214100

elukey claimed this task.

The issue should be fixed now thanks to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1225021. Closing the task, please re-open if it re-occurs!