Change Details

= Background = 72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813). = Goals = 1. Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged. 1. DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs. 1. (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present. = Non-goals = - A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs - New monitoring infrastructure beyond simple one-off scripts - Performing deep modifications to debian-installer - Reimaging/reinstalling the fleet en masse - Fixing 100% of existing systems (there will be ones that aren't trivial to fix) - In general, anything that involves adding new moving parts to production = Plan: goal #1: correctness when freshly imaged = Two of our existing partman configs, `ms-be` and `ms-be-legacy`, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair: ```d-i grub-installer/bootdev string /dev/sda /dev/sdb # this workarounds LP #1012629 / Debian #666974 # it makes grub-installer to jump to step 2, where it uses bootdev d-i grub-installer/only_debian boolean false ``` Despite that [[https://bugs.debian.org/666974 | Debian #666974]] is long marked as closed, the workaround of setting `only_debian` to false is still necessary on stretch. The plan here is straightforward: all partman configs that specify a `partman-auto-raid/recipe` will be updated to include the above grub-installer stanzas as well, with `bootdev` set to whatever physical disks are part of the RAID group for `/boot` or `/`. = Plan: goal #2: correctness after a drive replacement = The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running `grub-install /dev/sdX`. The plan: add instructions for such as a step in the new [[https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | DCops runbook]]. A possibility for further work: write a script that performs necessary `mdadm` invocations to begin repairing arrays in addition to invoking `grub-install`. Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable. = Plan: goal #3: slowly fix up the fleet = In theory, all that is necessary is to invoke `grub-install` many times. In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky. Tentative plan: eventually, execute this across the entire fleet: If the block device backing /boot is a md device: For each of its member partitions sdXN, run grub-install /dev/sdX We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives. I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2). Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines. == Known weird stuff == * Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077 === Footnotes === †: Generated with: `cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'`

= Background = 72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813). = Goals = 1. Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged. 1. DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs. 1. (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present. = Non-goals = - A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs - New monitoring infrastructure beyond simple one-off scripts - Performing deep modifications to debian-installer - Reimaging/reinstalling the fleet en masse - Fixing 100% of existing systems (there will be ones that aren't trivial to fix) - In general, anything that involves adding new moving parts to production = Plan: goal #1: correctness when freshly imaged = **Done**: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490404/ Two of our existing partman configs, `ms-be` and `ms-be-legacy`, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair: ```d-i grub-installer/bootdev string /dev/sda /dev/sdb # this workarounds LP #1012629 / Debian #666974 # it makes grub-installer to jump to step 2, where it uses bootdev d-i grub-installer/only_debian boolean false ``` Despite that [[https://bugs.debian.org/666974 | Debian #666974]] is long marked as closed, the workaround of setting `only_debian` to false is still necessary on stretch. The plan here is straightforward: all partman configs that specify a `partman-auto-raid/recipe` will be updated to include the above grub-installer stanzas as well, with `bootdev` set to whatever physical disks are part of the RAID group for `/boot` or `/`. = Plan: goal #2: correctness after a drive replacement = The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running `grub-install /dev/sdX`. The plan: add instructions for such as a step in the new [[https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | DCops runbook]]. A possibility for further work: write a script that performs necessary `mdadm` invocations to begin repairing arrays in addition to invoking `grub-install`. Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable. = Plan: goal #3: slowly fix up the fleet = In theory, all that is necessary is to invoke `grub-install` many times. In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky. Tentative plan: eventually, execute this across the entire fleet: If the block device backing /boot is a md device: For each of its member partitions sdXN, run grub-install /dev/sdX We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives. I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2). Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines. == Known weird stuff == * Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077 === Footnotes === †: Generated with: `cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'`