Page MenuHomePhabricator

4 failed reimages on wdqs1029, 1030, 1031, 1032
Open, MediumPublic

Description

As part of T412235, I'm reimaging a few WDQS servers. Reimaging succeeded for wdqs1028, but failed for wdqs1029, wdqs1030, wdqs1031, and wdqs1032. The install console does not seem to be available. I have not tried to investigate. Could we have infrastructure foundation take over this reimaging and use it as an opportunity to improve the reliability of the reimaging cookbook?

Logs are available on cumin1003, and output was logged on T412235.

It would be nice to have one of those servers reimaged soon. The other are not urgent, so there is time to experiment if needed.

Event Timeline

Gehel renamed this task from 2 failed reimages on wdqs1029 and wdqs1030 to 2 failed reimages on wdqs1029, 1030, 1031.Dec 12 2025, 8:20 AM
Gehel updated the task description. (Show Details)
Gehel renamed this task from 2 failed reimages on wdqs1029, 1030, 1031 to 2 failed reimages on wdqs1029, 1030, 1031, 1032.Dec 12 2025, 1:21 PM
Gehel updated the task description. (Show Details)
Gehel renamed this task from 2 failed reimages on wdqs1029, 1030, 1031, 1032 to 4 failed reimages on wdqs1029, 1030, 1031, 1032.Dec 12 2025, 2:53 PM
Gehel updated the task description. (Show Details)

logs on cumin1003:/var/log/spicerack/sre/hosts/reimage/202512120914_gehel_3706738_wdqs1032.out show a successful puppet run

Extended log (/var/log/spicerack/sre/hosts/reimage-extended.log) indicates that the server did not complete a reboot afterward:

2025-12-12 10:47:29,210 gehel 3706738 [WARNING wmflib.decorators:234 in wrapper] [239/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot 
for wdqs1032.eqiad.wmnet not found yet, keep polling for it: unable to get uptime

Serial connection via management console indicates an issue with grub, with grub being in a rescue shell:

grub rescue>
elukey triaged this task as Medium priority.Mon, Dec 15, 3:21 PM

Tried to reimage wdqs1029, got this while booting (checked the racadm's console):

������������������������������������������������������������������������������ͻ
� (+)BIOS Config...� Task 1  of 1  - BIOS Configuration (JID_658387574202)     �
�                  �                                                           �
�                  � Progress: 100%                                            �
�                  � Elapsed Time: 00:19                                       �
�                  � Worst Case Time: 19:00                                    �
�                  � Task Status: Completed                                    �
�                  � Last Status Message: Task Completed successfully          �
�                  �����������������������������������������������������������͹

�                  � Total Elapsed Time: 00:00:19                              �
�                  � Failed Task Count:0                                       �
�                  � Warning Task Count:0                                      �
�                  � Success Task Count:1                                      �
������������������������������������������������������������������������������͹
�Legend:           � Console Log:                                              �
�                  � Collecting the list of tasks to be executed               �
�(+) : Success     � Task in Progress                                          �
�(!) : Warning     � Task Completed successfully                               �
�(X) : Failed      �                                                           �
�(.) : Pending     �                                                           �
�->  : In Progress �                                                           �
������������������������������������������������������������������������������ͼ

Then the host rebooted and the Debian install started. I guess that some BMC-related task was still in progress (were the hosts' BIOS/BMC firmware upgraded recently?). The the reimage completed successfully.

@Gehel I'll help with these reimages, but the following statements are not great imho:

I have not tried to investigate.
Could we have infrastructure foundation take over this reimaging and use it as an opportunity to improve the reliability of the reimaging cookbook?

Without any investigation and the fact that the grub shell was reached, anything could be at fault. We don't know what kind of things the host went through (for example, from the above post it seems that an upgrade of the BIOS was attempted), and the reliability of reimage should be considered in the context of a specific problem.

It worked nicely with Bookworm, but now I noticed that Trixie was targeted first. The hosts are Dell PowerEdge 440, and this issue smells like T407586. I'll try to do more tests tomorrow :)

I can repro with Trixie (but yesterday before leaving it also failed at the end for Bookworm, so I suspect it is not OS-dependent) and I see this after the first puppet run+reboot:

Booting from Embedded SATA Port Disk B: debian
Welcome to GRUB!

error: disk `mduuid/8f5e64577972dc674bb1aae21438f8f4' not found.
grub rescue>

The fact that it doesn't happen after the Debian-Install + reboot, but only after the first puppet run + reboot, makes me wonder if this is a special use case for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1205197.

elukey@cumin1003:~$ grep dup-uefi /var/log/spicerack/sre/hosts/reimage/202512160943_elukey_1067710_wdqs1029.out
Notice: /Stage[main]/Raid::Md/File[/usr/local/bin/dup-uefi]/ensure: defined content as '{sha256}b19b79c518a22af8f44c79cc6c4c7983924bf09ee9a77523d400e091b0d17835'
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Unit[dup-uefi.service]/File[/lib/systemd/system/dup-uefi.service]/ensure: defined content as '{sha256}5a34d4dcf5306903bf804fe0ce8bea77205e3452b27924c09fa4cdbcd3fabb7a'
Info: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Unit[dup-uefi.service]/File[/lib/systemd/system/dup-uefi.service]: Scheduling refresh of Exec[systemd daemon-reload for dup-uefi.service (dup-uefi.service)]
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Unit[dup-uefi.service]/Exec[systemd daemon-reload for dup-uefi.service (dup-uefi.service)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Syslog[dup-uefi]/File[/var/log/dup-uefi]/ensure: created
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Syslog[dup-uefi]/Rsyslog::Conf[dup-uefi]/File[/etc/rsyslog.d/40-dup-uefi.conf]/ensure: defined content as '{sha256}3f89314df0591c028f0614f6f614a2be95388575f2d450473b05f61050902405'
Info: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Syslog[dup-uefi]/Rsyslog::Conf[dup-uefi]/File[/etc/rsyslog.d/40-dup-uefi.conf]: Scheduling refresh of Service[rsyslog]
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Syslog[dup-uefi]/Logrotate::Conf[dup-uefi]/File[/etc/logrotate.d/dup-uefi]/ensure: defined content as '{sha256}fb5e1da40b9d354b1d4291cba74fcaaaa69c9ba311ded31c83d7c0b2933702c1'
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Timer[dup-uefi]/Systemd::Service[dup-uefi]/Systemd::Unit[dup-uefi.timer]/File[/lib/systemd/system/dup-uefi.timer]/ensure: defined content as '{sha256}f1904b35897bd8c0145268b90b1628a85ff5fd29c099e2407c19fb6e00cb45ff'
Info: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Timer[dup-uefi]/Systemd::Service[dup-uefi]/Systemd::Unit[dup-uefi.timer]/File[/lib/systemd/system/dup-uefi.timer]: Scheduling refresh of Exec[systemd daemon-reload for dup-uefi.timer (dup-uefi.timer)]
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Timer[dup-uefi]/Systemd::Service[dup-uefi]/Systemd::Unit[dup-uefi.timer]/Exec[systemd daemon-reload for dup-uefi.timer (dup-uefi.timer)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Timer[dup-uefi]/Systemd::Service[dup-uefi]/Service[dup-uefi.timer]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Raid::Md/Systemd::Timer::Job[dup-uefi]/Systemd::Timer[dup-uefi]/Systemd::Service[dup-uefi]/Service[dup-uefi.timer]: Unscheduling refresh on Service[dup-uefi.timer]

I tweaked the reimage script to stop after the first puppet run, so I was able to ssh to the wdqs1029 host:

elukey@wdqs1029:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            126G     0  126G   0% /dev
tmpfs            26G  1.6M   26G   1% /run
efivarfs        304K  238K   62K  80% /sys/firmware/efi/efivars
/dev/md0         73G  2.3G   67G   4% /
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs           126G     0  126G   0% /tmp
/dev/md2        3.3T  2.1M  3.2T   1% /srv
tmpfs           1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
tmpfs           1.0M     0  1.0M   0% /run/credentials/serial-getty@ttyS1.service
/dev/sda2       241M  318K  240M   1% /boot/efi
tmpfs            26G  4.0K   26G   1% /run/user/13926

elukey@wdqs1029:~$ sudo fdisk -l
Disk /dev/sdb: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MTFDDAK1T9TDN   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 49A778FA-5509-4EA7-B2BB-6CD6C8B06CC4

Device         Start        End    Sectors  Size Type
/dev/sdb1       2048       4095       2048    1M BIOS boot
/dev/sdb2       4096     503807     499712  244M EFI System
/dev/sdb3     503808  156753919  156250112 74.5G Linux RAID
/dev/sdb4  156753920  158754815    2000896  977M Linux RAID
/dev/sdb5  158754816 3750748159 3591993344  1.7T Linux RAID


Disk /dev/sda: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MTFDDAK1T9TDN   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: F39A2421-4A8C-409D-8CCF-886505296F68

Device         Start        End    Sectors  Size Type
/dev/sda1       2048       4095       2048    1M BIOS boot
/dev/sda2       4096     503807     499712  244M EFI System
/dev/sda3     503808  156753919  156250112 74.5G Linux RAID
/dev/sda4  156753920  158754815    2000896  977M Linux RAID
/dev/sda5  158754816 3750748159 3591993344  1.7T Linux RAID


Disk /dev/md2: 3.35 TiB, 3677930651648 bytes, 7183458304 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes


Disk /dev/md0: 74.44 GiB, 79931899904 bytes, 156116992 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/md1: 976 MiB, 1023410176 bytes, 1998848 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

So dup-uefi seems to have run correctly:

elukey@wdqs1029:~$ sudo journalctl -u dup-uefi
Dec 16 11:28:28 wdqs1029 systemd[1]: Starting dup-uefi.service - Dup the UEFI part, so we survive a disk failure...
Dec 16 11:28:28 wdqs1029 dup-uefi[32060]: Info: Unmounting /dev/sda2

Dec 16 11:28:30 wdqs1029 dup-uefi[32060]: Info: /dev/sda2 and /dev/sdb2 are now identical  <=======================================================

Dec 16 11:28:31 wdqs1029 dup-uefi[32060]: Info: Remounting /dev/sda2
Dec 16 11:28:31 wdqs1029 systemd[1]: dup-uefi.service: Deactivated successfully.
Dec 16 11:28:31 wdqs1029 systemd[1]: Finished dup-uefi.service - Dup the UEFI part, so we survive a disk failure.
Dec 16 11:28:31 wdqs1029 systemd[1]: dup-uefi.service: Consumed 1.578s CPU time, 497.6M memory peak.
Dec 16 11:30:28 wdqs1029 systemd[1]: Starting dup-uefi.service - Dup the UEFI part, so we survive a disk failure...
Dec 16 11:30:28 wdqs1029 dup-uefi[32772]: Info: Unmounting /dev/sda2
Dec 16 11:30:29 wdqs1029 dup-uefi[32772]: Info: Skipping, /dev/sda2 and /dev/sdb2 are already identical
Dec 16 11:30:30 wdqs1029 dup-uefi[32772]: Info: Remounting /dev/sda2
Dec 16 11:30:30 wdqs1029 systemd[1]: dup-uefi.service: Deactivated successfully.
Dec 16 11:30:30 wdqs1029 systemd[1]: Finished dup-uefi.service - Dup the UEFI part, so we survive a disk failure.
Dec 16 11:30:30 wdqs1029 systemd[1]: dup-uefi.service: Consumed 1.117s CPU time, 246.9M memory peak.
Dec 16 11:30:46 wdqs1029 systemd[1]: Starting dup-uefi.service - Dup the UEFI part, so we survive a disk failure...
Dec 16 11:30:46 wdqs1029 dup-uefi[33281]: Info: Unmounting /dev/sda2
Dec 16 11:30:48 wdqs1029 dup-uefi[33281]: Info: Skipping, /dev/sda2 and /dev/sdb2 are already identical
Dec 16 11:30:48 wdqs1029 dup-uefi[33281]: Info: Remounting /dev/sda2
Dec 16 11:30:48 wdqs1029 systemd[1]: dup-uefi.service: Deactivated successfully.
Dec 16 11:30:48 wdqs1029 systemd[1]: Finished dup-uefi.service - Dup the UEFI part, so we survive a disk failure.
Dec 16 11:30:48 wdqs1029 systemd[1]: dup-uefi.service: Consumed 1.093s CPU time, 246.4M memory peak.
elukey@wdqs1029:~$ efibootmgr 
BootCurrent: 0005
BootOrder: 0005,0006,0000,0003,0001,0009,000C,000B,000A,0013,000D,0017,000E,0019,001B
Boot0000* NIC in Slot 2 Port 1 Partition 1	VenHw(986d1755-b9d0-4f8d-a0da-d1db18672045)
Boot0001* Embedded NIC 1 Port 1 Partition 1	VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0002* Hard drive C:	VenHw(d6c0639f-c705-4eb9-aa4f-5802d8823de6)feff0000000011000000050000000106740200c802000000330700c8890200c80000000000000000000000000000000000000000000000000000000000000000000000000011000000000000000000200002010c00d041030a0000000001010600051103120a000200ffff00007fff04004d00540046004400440041004b00310054003900540044004e000000feff0000000011000000050000000106b20200c8020000003d0700c8c70200c80000000000000000000000000000000000000000000000000000000000010000000000000012000100000000000000200002010c00d041030a0000000001010600051103120a000300ffff00007fff04004d00540046004400440041004b00310054003900540044004e000000
Boot0003* Windows Boot Manager	HD(2,GPT,fb4b47bf-94a6-4abe-8896-d5a0d820a715,0x96800,0x32000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)57494e444f5753000100000088000000780000004200430044004f0042004a004500430054003d007b00390064006500610038003600320063002d0035006300640064002d0034006500370030002d0061006300630031002d006600330032006200330034003400640034003700390035007d00000069000100000010000000040000007fff0400
Boot0004* BRCM MBA Slot 0400 v20.14.0	BBS(128,BRCM MBA Slot 0400 v20.14.0,0x0)feff0400000000000000000000000200910100cc80000000800000cc750100cc00000000000000000000000000000000000000000000000000000000000000000000000000130000020000000000001c0002010c00d041030a0000000001010600051c0101060000007fff04004200520043004d0020004d0042004100200053006c006f0074002000300034003000300020007600320030002e00310034002e0030000000
Boot0005* debian	HD(2,GPT,69e62d28-a3e7-4b8f-8eb6-be096eb39abb,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0006* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,4257cfb4-7733-45f1-868a-a56273441de8,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0009* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,546b8527-13e9-470c-9e06-9a389ca71e8a,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000A* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,c5bc729d-ec98-450f-807f-c363df7eebc6,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000B* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,f1f1eef2-7ba1-47dd-ab1a-94df5c9b65ec,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000C* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,f3eed661-3130-4dda-bba7-ec1f84595594,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000D* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,2db5552d-c404-49df-a802-83cf37c8cddb,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000E* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,69e62d28-a3e7-4b8f-8eb6-be096eb39abb,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0013* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,67579693-84cc-4e9a-8afc-19fb11d8ac64,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0017* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,5f5a243c-18e9-4d19-807e-88a0d2886b1d,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0019* debian	HD(2,GPT,3e13ac0d-22e2-40f0-8947-659396d07837,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot001B* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,3e13ac0d-22e2-40f0-8947-659396d07837,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
MirrorStatus: Platform does not support address range mirror
DesiredMirroredPercentageAbove4G: 0.00
DesiredMirrorMemoryBelow4GB: false

Checked the content of both UEFI partitions:

elukey@wdqs1029:~$ sudo find /boot/efi/EFI
/boot/efi/EFI
/boot/efi/EFI/debian
/boot/efi/EFI/debian/grubx64.efi
/boot/efi/EFI/Dell
/boot/efi/EFI/Dell/BootOptionCache
/boot/efi/EFI/Dell/BootOptionCache/BootOptionCache.dat
/boot/efi/EFI/BOOT
/boot/efi/EFI/BOOT/BOOTX64.EFI

elukey@wdqs1029:~$ mkdir test-uefi
elukey@wdqs1029:~$ sudo mount /dev/sdb2 test-uefi/
elukey@wdqs1029:~$ find test-uefi/
test-uefi/
test-uefi/EFI
test-uefi/EFI/debian
test-uefi/EFI/debian/grubx64.efi
test-uefi/EFI/Dell
test-uefi/EFI/Dell/BootOptionCache
test-uefi/EFI/Dell/BootOptionCache/BootOptionCache.dat
test-uefi/EFI/BOOT
test-uefi/EFI/BOOT/BOOTX64.EFI

Tried to manually reboot the host, and it worked fine, no grub issue.

A similar issue happened in T404356, but it was not 100% predictable how to reproduce.

I tried to see if the issue was related to the dup-uefi service not completing in time after the first puppet run, but it doesn't seem so. The issue doesn't happen after the Debian install's first reboot, it happens after puppet runs, so it must be something related to that? Maybe it is dup-uefi that sets a new boot device that is somehow wrong?

I am able to boot with

Booting from Embedded SATA Port Disk A: debian
error: disk `mduuid/07e87b0bbe0a35ee06e5d22da7aefd5a' not found.

grub rescue> set
fw_path='(hd0,gpt2)/EFI/debian'
prefix='(mduuid/07e87b0bbe0a35ee06e5d22da7aefd5a)/boot/grub'
root='mduuid/07e87b0bbe0a35ee06e5d22da7aefd5a'

grub rescue> ls (hd0,gpt2)/
error: unknown filesystem.

grub rescue> set root=(md/0)
grub rescue> set prefix=(md/0)/boot/grub
grub rescue> insmod normal
grub rescue> normal

I checked the md uuids (I hope this is the right way) and I dont find 07e8.. anywhere:

elukey@wdqs1029:~$ ls -lha /dev/disk/by-uuid
total 0
drwxr-xr-x 2 root root 120 Dec 17 11:31 .
drwxr-xr-x 8 root root 160 Dec 17 11:31 ..
lrwxrwxrwx 1 root root   9 Dec 17 11:31 0ae24514-e565-44d2-a157-3f878a37c99b -> ../../md2
lrwxrwxrwx 1 root root   9 Dec 17 11:31 667a0861-3db4-4149-8e86-8603dc955cc1 -> ../../md1
lrwxrwxrwx 1 root root  10 Dec 17 11:31 78A8-9DE2 -> ../../sdb2
lrwxrwxrwx 1 root root   9 Dec 17 11:31 b4702613-2d6e-44d6-a537-67c5e62eecc1 -> ../../md0

Where did grub get it from?

Found a reference of the wrong mduuid in /boot/efi/EFI/debian/grubx64.efi, on both partitions.

@Gehel I'll help with these reimages, but the following statements are not great imho:

I have not tried to investigate.
Could we have infrastructure foundation take over this reimaging and use it as an opportunity to improve the reliability of the reimaging cookbook?

Without any investigation and the fact that the grub shell was reached, anything could be at fault. We don't know what kind of things the host went through (for example, from the above post it seems that an upgrade of the BIOS was attempted), and the reliability of reimage should be considered in the context of a specific problem.

Thanks for the help in finding how to make this work!

I very much agree this isn't great :) The question I'm trying to raise here, is who's responsibility is it to investigate. I would argue that as DPE SRE, we should be responsible for higher level issues, but we shouldn't really be expected to have the skills and knowledge for investigation at this level. I have not tried anything else besides running the reimage cookbook, so I'm not sure where this attempted BIOS upgrade is coming from. The question of ownership and responsibilities is probably better suited for a more direct conversation than comments on a phab task. I've reached to @LSobanski to get this conversation started. This will have to wait until January...

The question I'm trying to raise here, is who's responsibility is it to investigate. I would argue that as DPE SRE, we should be responsible for higher level issues, but we shouldn't really be expected to have the skills and knowledge for investigation at this level.

I agree that we need to discuss this, but my personal opinion is that this is not a standard configuration for partman and the host has probably just been moved to UEFI, so a little investigation is expected from any SRE. We can surely have the expertise to dive deep a little more, but I don't personally like tasks in which the description is "I tried to reimage, it didn't work, please do it so you can also improve the cookbook's reliability". I have no idea about the history of this host, why it has this configuration and what was attempted/configured before, so I am happy to help but it is not strictly I/F's responsibility :)

Really interesting: I tried with Bookworm and the issue doesn't happen, so this seems to be a UEFI/Trixie specific thing. Could be a variant of T407586.

Change #1220315 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move wqds10[28-1032] to insetup role

https://gerrit.wikimedia.org/r/1220315

Change #1220315 merged by Elukey:

[operations/puppet@production] Move wqds10[29-32] to insetup role

https://gerrit.wikimedia.org/r/1220315

I used install-console from cumin1003 to inspect details before the first puppet run (and after the debian install and first reboot):

root@wdqs1029:~# efibootmgr 
BootCurrent: 0005
BootOrder: 0005,0006,0000,0003,0001,0009,000C,000B,000A,0013,000D,0017,000E,001B,000F,001F,0010,0025,0011,0029,0007,0012,002D
Boot0000* NIC in Slot 2 Port 1 Partition 1	VenHw(986d1755-b9d0-4f8d-a0da-d1db18672045)
Boot0001* Embedded NIC 1 Port 1 Partition 1	VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0002* Hard drive C:	VenHw(d6c0639f-c705-4eb9-aa4f-5802d8823de6)feff0000000011000000050000000106740200c802000000330700c8890200c80000000000000000000000000000000000000000000000000000000000000000000000000011000000000000000000200002010c00d041030a0000000001010600051103120a000200ffff00007fff04004d00540046004400440041004b00310054003900540044004e000000feff0000000011000000050000000106b20200c8020000003d0700c8c70200c80000000000000000000000000000000000000000000000000000000000010000000000000012000100000000000000200002010c00d041030a0000000001010600051103120a000300ffff00007fff04004d00540046004400440041004b00310054003900540044004e000000
Boot0003* Windows Boot Manager	HD(2,GPT,fb4b47bf-94a6-4abe-8896-d5a0d820a715,0x96800,0x32000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)57494e444f5753000100000088000000780000004200430044004f0042004a004500430054003d007b00390064006500610038003600320063002d0035006300640064002d0034006500370030002d0061006300630031002d006600330032006200330034003400640034003700390035007d00000069000100000010000000040000007fff0400
Boot0004* BRCM MBA Slot 0400 v20.14.0	BBS(128,BRCM MBA Slot 0400 v20.14.0,0x0)feff0400000000000000000000000200910100cc80000000800000cc750100cc00000000000000000000000000000000000000000000000000000000000000000000000000130000020000000000001c0002010c00d041030a0000000001010600051c0101060000007fff04004200520043004d0020004d0042004100200053006c006f0074002000300034003000300020007600320030002e00310034002e0030000000
Boot0005* debian	HD(2,GPT,943f5d04-374c-47f1-b2f8-4cd128aaf879,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0006* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,4257cfb4-7733-45f1-868a-a56273441de8,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0007* debian	HD(2,GPT,fc25c65d-96ff-4732-a5e7-89db93019fb1,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0009* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,546b8527-13e9-470c-9e06-9a389ca71e8a,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000A* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,c5bc729d-ec98-450f-807f-c363df7eebc6,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000B* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,f1f1eef2-7ba1-47dd-ab1a-94df5c9b65ec,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000C* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,f3eed661-3130-4dda-bba7-ec1f84595594,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000D* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,2db5552d-c404-49df-a802-83cf37c8cddb,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000E* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,69e62d28-a3e7-4b8f-8eb6-be096eb39abb,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot000F* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,716d6c01-0ed6-4981-80ee-7df251a07e56,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0010* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,d60785d1-fa31-464c-927a-3faeb28868db,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0011* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,f19dda06-81db-4f60-b987-c80d14a3b423,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0012* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,fc25c65d-96ff-4732-a5e7-89db93019fb1,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0013* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,67579693-84cc-4e9a-8afc-19fb11d8ac64,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0017* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,5f5a243c-18e9-4d19-807e-88a0d2886b1d,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot001B* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,3e13ac0d-22e2-40f0-8947-659396d07837,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot001F* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,9d2b817a-cae5-4cdf-ab43-1ba70c8796a7,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0025* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,3f161457-52fd-41d2-a760-3829503474b9,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot0029* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,51e834fc-620d-4a14-9a9a-f7942867a064,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
Boot002D* debian 4257cfb4-7733-45f1-868a-a56273441de8	HD(2,GPT,943f5d04-374c-47f1-b2f8-4cd128aaf879,0x1000,0x7a000)/File(\EFI\debian\grubx64.efi)
MirrorStatus: Platform does not support address range mirror
DesiredMirroredPercentageAbove4G: 0.00
DesiredMirrorMemoryBelow4GB: false
NAME    FSTYPE            FSVER LABEL      UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda                                                                                            
├─sda1  ceph_bluestore                                                                         
├─sda2  vfat              FAT32            78A8-9DE2                                           
├─sda3  linux_raid_member 1.2   wdqs1029:0 9e2403aa-f2cb-e7e8-9d05-f0059897df0f                
│ └─md0 ext4              1.0              54ac1454-88f2-4955-a788-441ff705adb4   67.9G     1% /
├─sda4  linux_raid_member 1.2   wdqs1029:1 828c3a0e-4ac7-eedd-9aa4-e340ce914e9d                
│ └─md1 swap              1                1e618326-2324-4098-b231-162a46e24e3c                [SWAP]
└─sda5  linux_raid_member 1.2   wdqs1029:2 774521ac-2936-cfd2-cfb3-6b9919b1a356                
  └─md2 ext4              1.0              afdf39cf-d6b4-496e-9a32-411ed7dd3f1a    3.1T     0% /srv
sdb                                                                                            
├─sdb1  ceph_bluestore                                                                         
├─sdb2  vfat              FAT32            78A8-9DE2                             239.9M     0% /boot/efi
├─sdb3  linux_raid_member 1.2   wdqs1029:0 9e2403aa-f2cb-e7e8-9d05-f0059897df0f                
│ └─md0 ext4              1.0              54ac1454-88f2-4955-a788-441ff705adb4   67.9G     1% /
├─sdb4  linux_raid_member 1.2   wdqs1029:1 828c3a0e-4ac7-eedd-9aa4-e340ce914e9d                
│ └─md1 swap              1                1e618326-2324-4098-b231-162a46e24e3c                [SWAP]
└─sdb5  linux_raid_member 1.2   wdqs1029:2 774521ac-2936-cfd2-cfb3-6b9919b1a356                
  └─md2 ext4              1.0              afdf39cf-d6b4-496e-9a32-411ed7dd3f1a    3.1T     0% /srv
root@wdqs1029:~# mdadm --detail /dev/md0  | grep UUID
              UUID : 9e2403aa:f2cbe7e8:9d05f005:9897df0f
root@wdqs1029:~# mdadm --detail /dev/md1  | grep UUID
              UUID : 828c3a0e:4ac7eedd:9aa4e340:ce914e9d
root@wdqs1029:~# mdadm --detail /dev/md2  | grep UUID
              UUID : 774521ac:2936cfd2:cfb36b99:19b1a356

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

root@wdqs1029:~# grep mduuid /boot/grub/grub.cfg 
	set root='mduuid/9e2403aaf2cbe7e89d05f0059897df0f'
	  search --no-floppy --fs-uuid --set=root --hint='mduuid/9e2403aaf2cbe7e89d05f0059897df0f'  54ac1454-88f2-4955-a788-441ff705adb4
		set root='mduuid/9e2403aaf2cbe7e89d05f0059897df0f'
		  search --no-floppy --fs-uuid --set=root --hint='mduuid/9e2403aaf2cbe7e89d05f0059897df0f'  54ac1454-88f2-4955-a788-441ff705adb4
		set root='mduuid/9e2403aaf2cbe7e89d05f0059897df0f'
		  search --no-floppy --fs-uuid --set=root --hint='mduuid/9e2403aaf2cbe7e89d05f0059897df0f'  54ac1454-88f2-4955-a788-441ff705adb4

It turns out the above is not really needed since simply setting the insetup role seemed to have worked, wdqs1029 is now running with Trixie. I'll try the other nodes to be sure.

Change #1220324 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] installserver: remove pause/debug from wdqs10[29-32]

https://gerrit.wikimedia.org/r/1220324

Change #1220324 merged by Elukey:

[operations/puppet@production] installserver: remove pause/debug from wdqs10[29-32]

https://gerrit.wikimedia.org/r/1220324

I can confirm that using the insetup role for data platform I don't see the issue anymore. I don't recall exactly the error, IIRC it was something related XML dumps, but with the wdqs::alternatives role the puppet run step in reimage failed (probably due to some missing settings for profile::statistics::dataset_mount). My only theory is that the puppet run after debian-install (and reboot) caused some inconsistency in the grub's config, ending up in the rescue shell. I am wondering if it made dup-uefi.service to somehow work on the wrong data or partially working, ending up in the wrong mduuid set in grub. I don't have solid proofs though.

Next steps:

  • Reimage all servers to Trixie to confirm my thesis.
  • Reapply the wdqs::alternatives role to wdqs1029, reboot and see if the issue re-appears (leaving puppet in a broken state).

Very weird, it re-happened for wdqs1031, I kicked off another reimage and this time it worked nicely.

Same for wdqs1032, another reimage needs to be kicked off.

@elukey Thanks for the investigation! Is there one of those servers that you're already happy with and I could take over? It seems that wdqs1029 has reimaged cleanly into trixie with role(insetup::data_platform_ferm). I could take it over and move it to role(wdqs::alternatives) if you're done with it.

@elukey Thanks for the investigation! Is there one of those servers that you're already happy with and I could take over? It seems that wdqs1029 has reimaged cleanly into trixie with role(insetup::data_platform_ferm). I could take it over and move it to role(wdqs::alternatives) if you're done with it.

IRC discussion with @elukey : we need some more time for investigation. Hopefully we can get one more server running (T412235) by mid next week (January 14).

I tried to reimage wdqs1029 today trying to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214488, but the second try led me to:

Booting from Embedded SATA Port Disk B: debian
error: disk `mduuid/2884f27b943ef5a81ae46a251fc7960c' not found.
grub rescue>

@jhathaway this is really weird, it seems like T404356 but I can't pin point the issue. Any ideas?

I tried to reimage wdqs1029 today trying to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214488, but the second try led me to:

Booting from Embedded SATA Port Disk B: debian
error: disk `mduuid/2884f27b943ef5a81ae46a251fc7960c' not found.
grub rescue>

@jhathaway this is really weird, it seems like T404356 but I can't pin point the issue. Any ideas?

That does sound similar to T404356. Perhaps the re-image chose, say sda1 as the EFI boot device, but the previous install had sdb1 as the boot device, and it remained first in the boot order? I'm not sure how that could happen, because debian should place whatever partition it uses as first in the boot order, but perhaps that failed? Happy to take a closer look, if you can reproduce, or if the box is still in the same state.

@jhathaway yes please! You can use wdqs1029 or 1030 :)

Mentioned in SAL (#wikimedia-operations) [2026-01-09T20:23:41Z] <jhathaway@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs1029.eqiad.wmnet with reason: T412451

Change #1225021 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] debian installer: format EFI partions

https://gerrit.wikimedia.org/r/1225021

@jhathaway yes please! You can use wdqs1029 or 1030 :)

@elukey I tested debian installer: format EFI partions on wdqs1029 and I am pretty confident that it resolves the issue. That is the good news. The bad news is that this discovery will necessitate me auditing any re-images that have occurred since UEFI: dup partition on MD RAID boxes was merged on December 3rd. For those re-images I will need to ensure their grub device is correct.

Change #1225021 merged by Elukey:

[operations/puppet@production] debian installer: format EFI partions

https://gerrit.wikimedia.org/r/1225021

@Gehel all fixed! wdqs1029, 1030, 1031 are ready with Trixie, meanwhile I had some issues with 1032, it seemed as if it wasn't able to HTTP boot from the network. If your team has time to test it and possibly follow up with DCops let me know, I can take a look at it in some days in case.