Determine how to survive a disk failure when using UEFI, one option may be to configure Debian to install all grub updates to both partitions, https://unix.stackexchange.com/a/623076.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T373519 Allow UEFI DHCP configs | |||
| Open | None | T376949 UEFI and software RAID |
Event Timeline
Unfortunately it does not appear that Ubuntu's solution was ever upstreamed into Debian, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=765740
Jesse, not sure how much things have changed since T215183: Redundant bootloaders for software RAID but there's at least a partman recipe in there that used to work.
Change #1082288 had a related patch set uploaded (by JHathaway; author: JHathaway):
[operations/puppet@production] efi: add script install grub on all efi sys parts
Thanks @CDanis, things change a bit with the addition of EFI as we need to sync the EFI partitions on all our drives, since they are not part of the raid set.
Lennart Poettering has a good overview of EFI booting and mentions the software RAID challenges as well, https://0pointer.net/blog/linux-boot-partitions.html
I put together a script which syncs the partitions after every kernel install, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082288, I think it is at least
a good starting point for a discussion on how we want to ensure redundancy. In my testing the script works, but we would need to ensure our /boot/efi partition has the nofail option in /etc/fstab
I haven't found the time to look into this deeper myself, but the topic also came up in #debian-boot and the following pointers were also provided there:
The Debian wiki page on this: https://wiki.debian.org/UEFI#RAID_for_the_EFI_System_Partition
Another tool that was mentioned is that helper used by Proxmox:
https://git.proxmox.com/?p=proxmox-kernel-helper.git;a=tree;f=src/proxmox-boot;h=2f7f08b40585cd40cbec18478ad717a6bb20765c;hb=HEAD
Thanks, I looked at this script, I chose to use grub-install, rather than relying on rsync being installed, but the result is similar
Another tool that was mentioned is that helper used by Proxmox:
https://git.proxmox.com/?p=proxmox-kernel-helper.git;a=tree;f=src/proxmox-boot;h=2f7f08b40585cd40cbec18478ad717a6bb20765c;hb=HEAD
I hadn't see this one before, it is interesting, but seems to quite specific to proxmox's architecture.
@Volans mentioned that it would be nice to sync the EFI partition following the replacement of a failed disk. One possibility would be to modify the script to support being called from mdadm's monitoring events. For instance when we receive a RebuildStarted event, we could sync the EFI partitions, see man 8 mdadm
I stumbled across this post which has a few interesting takes, like using mdadm metadata v1.0 to keep the ESP partition bootable but still active in mdadm, and also, forcing a manual resync every boot in case UEFI writes to its active ESP pre-boot (since it doesn't know anything about mdadm, of course). I'm not 100% convinced but it doesn't sound too unreasonable?
Thanks @CDanis I happened upon that post as well, I don't think their approach is unreasonable. I think there are different trade offs between complexity and adherence to spec. My preference is to try the sync script, but if that fails I'm happy to look at their approach.
@jhathaway should we go ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082288?
Change #1205197 had a related patch set uploaded (by JHathaway; author: JHathaway):
[operations/puppet@production] UEFI: dup partition on MD RAID boxes
Change #1082288 abandoned by JHathaway:
[operations/puppet@production] EFI: install grub on all EFI partitions
Reason:
I think 1205197 is a better approach.
Change #1205197 merged by JHathaway:
[operations/puppet@production] UEFI: dup partition on MD RAID boxes
Change #1214563 had a related patch set uploaded (by JHathaway; author: JHathaway):
[operations/puppet@production] UEFI: remove dup timer on bullseye
Change #1214563 merged by JHathaway:
[operations/puppet@production] UEFI: remove dup timer on bullseye