= Background =
72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813).
= Goals =
1. Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged.
1. DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs.
1. (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present.
= Non-goals =
- A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs
- New monitoring infrastructure beyond simple one-off scripts
- Performing deep modifications to debian-installer
- Reimaging/reinstalling the fleet en masse
- Fixing 100% of existing systems (there will be ones that aren't trivial to fix)
- In general, anything that involves adding new moving parts to production
= Plan: goal #1: correctness when freshly imaged =
Two of our existing partman configs, `ms-be` and `ms-be-legacy`, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair:
```d-i grub-installer/bootdev string /dev/sda /dev/sdb
# this workarounds LP #1012629 / Debian #666974
# it makes grub-installer to jump to step 2, where it uses bootdev
d-i grub-installer/only_debian boolean false
```
Despite that [[https://bugs.debian.org/666974 | Debian #666974]] is long marked as closed, the workaround of setting `only_debian` to false is still necessary on stretch.
The plan here is straightforward: all partman configs that specify a `partman-auto-raid/recipe` will be updated to include the above grub-installer stanzas as well, with `bootdev` set to whatever physical disks are part of the RAID group for `/boot` or `/`.
= Plan: goal #2: correctness after a drive replacement =
The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running `grub-install /dev/sdX`. The plan: add instructions for such as a step in the new [[https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | DCops runbook]].
A possibility for further work: write a script that performs necessary `mdadm` invocations to begin repairing arrays in addition to invoking `grub-install`.
Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable.
= Plan: goal #3: slowly fix up the fleet =
In theory, all that is necessary is to invoke `grub-install` many times.
In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky.
Tentative plan: eventually, execute this across the entire fleet:
If the block device backing /boot is a md device:
For each of its member partitions sdXN, run grub-install /dev/sdX
We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives.
I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2).
Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines.
== Known weird stuff ==
* Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077
=== Footnotes ===
†: Generated with: `cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'`