Background
72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813).
Goals
- Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged.
- DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs.
- (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present.
Non-goals
- A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs
- New monitoring infrastructure beyond simple one-off scripts
- Performing deep modifications to debian-installer
- Reimaging/reinstalling the fleet en masse
- Fixing 100% of existing systems (there will be ones that aren't trivial to fix)
- In general, anything that involves adding new moving parts to production
Plan: goal #1: correctness when freshly imaged
Done: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490404/
Two of our existing partman configs, ms-be and ms-be-legacy, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair:
d-i grub-installer/bootdev string /dev/sda /dev/sdb # this workarounds LP #1012629 / Debian #666974 # it makes grub-installer to jump to step 2, where it uses bootdev d-i grub-installer/only_debian boolean false
Despite that Debian #666974 is long marked as closed, the workaround of setting only_debian to false is still necessary on stretch.
The plan here is straightforward: all partman configs that specify a partman-auto-raid/recipe will be updated to include the above grub-installer stanzas as well, with bootdev set to whatever physical disks are part of the RAID group for /boot or /.
Plan: goal #2: correctness after a drive replacement
The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running grub-install /dev/sdX. The plan: add instructions for such as a step in the new DCops runbook.
A possibility for further work: write a script that performs necessary mdadm invocations to begin repairing arrays in addition to invoking grub-install.
Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable.
Plan: goal #3: slowly fix up the fleet
In theory, all that is necessary is to invoke grub-install many times.
In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky.
Tentative plan: eventually, execute this across the entire fleet:
If the block device backing /boot is a md device: For each of its member partitions sdXN, run grub-install /dev/sdX
We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives.
I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2).
Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines.
Known weird stuff
- Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077
Footnotes
†: Generated with: cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'