Page MenuHomePhabricator

Redundant bootloaders for software RAID
Open, NormalPublic

Description

Background

72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813).

Goals

  1. Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged.
  2. DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs.
  3. (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present.

Non-goals

  • A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs
  • New monitoring infrastructure beyond simple one-off scripts
  • Performing deep modifications to debian-installer
  • Reimaging/reinstalling the fleet en masse
  • Fixing 100% of existing systems (there will be ones that aren't trivial to fix)
  • In general, anything that involves adding new moving parts to production

Plan: goal #1: correctness when freshly imaged

Done: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490404/

Two of our existing partman configs, ms-be and ms-be-legacy, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair:

d-i	grub-installer/bootdev		string	/dev/sda /dev/sdb
# this workarounds LP #1012629 / Debian #666974
# it makes grub-installer to jump to step 2, where it uses bootdev
d-i	grub-installer/only_debian		boolean false

Despite that Debian #666974 is long marked as closed, the workaround of setting only_debian to false is still necessary on stretch.

The plan here is straightforward: all partman configs that specify a partman-auto-raid/recipe will be updated to include the above grub-installer stanzas as well, with bootdev set to whatever physical disks are part of the RAID group for /boot or /.

Plan: goal #2: correctness after a drive replacement

The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running grub-install /dev/sdX. The plan: add instructions for such as a step in the new DCops runbook.

A possibility for further work: write a script that performs necessary mdadm invocations to begin repairing arrays in addition to invoking grub-install.

Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable.

Plan: goal #3: slowly fix up the fleet

In theory, all that is necessary is to invoke grub-install many times.

In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky.

Tentative plan: eventually, execute this across the entire fleet:

If the block device backing /boot is a md device:
  For each of its member partitions sdXN, run grub-install /dev/sdX

We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives.

I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2).

Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines.

Known weird stuff

  • Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077

Footnotes

†: Generated with: cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'

Details

Related Gerrit Patches:

Event Timeline

RobH triaged this task as Normal priority.Feb 4 2019, 6:06 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 4 2019, 6:06 PM
CDanis added a subscriber: CDanis.Feb 4 2019, 6:51 PM

I know very little about debian-installer, but here's a guess based on what I found in the puppet repo:

% git grep grub-installer/bootdev
modules/install_server/files/autoinstall/common.cfg:d-i grub-installer/bootdev  string  /dev/sda
modules/install_server/files/autoinstall/partman/ms-be-legacy.cfg:d-i   grub-installer/bootdev  string  /dev/sdm /dev/sdn
modules/install_server/files/autoinstall/partman/ms-be.cfg:d-i  grub-installer/bootdev  string  /dev/sda /dev/sdb
modules/install_server/files/autoinstall/virtual.cfg:d-i    grub-installer/bootdev  string default

The particular config used on thumbor2002 was raid1-lvm-ext4-srv.cfg, which -- although it sets up RAID1 between sda and sdb -- does not override the grub-installer/bootdev param from common.cfg.

CDanis added a comment.Feb 4 2019, 9:15 PM

Assumption 1: the partman-auto-raid directive exactly correlates with our use of Linux software RAID in production.
Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 configs must override grub-installer/bootdev to list all the relevant disks.

If both of those are true, we have a lot of configs to update (35!). Only ms-be and ms-be-legacy seem to set grub-installer/bootdev.

1cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -2 -3 <(grep -l partman-auto-raid *) <(grep -l grub-installer/bootdev *)
2aqs-cassandra-8ssd-2srv.cfg
3cassandrahosts-3ssd-jbod.cfg
4cassandrahosts-4ssd.cfg
5cassandrahosts-4ssd-jbod.cfg
6cassandrahosts-5ssd.cfg
7cassandrahosts-5ssd-jbod.cfg
8conf-lvm.cfg
9cp2018.cfg
10druid-4ssd-raid10.cfg
11elasticsearch-raid0.cfg
12ganeti-raid1.cfg
13graphite.cfg
14kubernetes-node.cfg
15logstash.cfg
16mc.cfg
17mw-raid1.cfg
18mw-raid1-lvm.cfg
19raid0-lvm-srv.cfg
20raid10-gpt.cfg
21raid10-gpt-srv-ext4.cfg
22raid10-gpt-srv-lvm-ext4-6disks.cfg
23raid10-gpt-srv-lvm-ext4.cfg
24raid10-gpt-srv-lvm-xfs.cfg
25raid1-1partition.cfg
26raid1-30G.cfg
27raid1.cfg
28raid1-gpt.cfg
29raid1-lvm.cfg
30raid1-lvm-conf.cfg
31raid1-lvm-ext4-srv.cfg
32raid1-lvm-ext4-srv-noswap.cfg
33raid1-lvm-xfs-nova.cfg
34raid5-gpt-lvm.cfg
35varnish.cfg
36varnish-oldssd.cfg

RobH added a comment.Feb 4 2019, 9:58 PM

Please note this is related to T156955.

jijiki added a subscriber: jijiki.Feb 5 2019, 1:32 PM
CDanis claimed this task.Feb 12 2019, 1:30 PM
CDanis renamed this task from sw raid1 doesnt install grub on sdb to Redundant bootloaders for software RAID.Feb 13 2019, 6:26 PM
CDanis updated the task description. (Show Details)

Change 490404 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

Volans added a subscriber: Volans.Feb 14 2019, 11:31 AM
CDanis added a subscriber: Joe.EditedFeb 20 2019, 2:22 PM

@Joe made me aware of the existence of partman configs present on install1002 that are not in Puppet.

The good news is that almost all such files are either editor backup files (ending in ~ or .bak), or files once in Puppet but since deleted from git:

cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -1 -3 \
  <(cat <(ls *.cfg) \
        <(git log --no-renames --diff-filter=D --summary -- . :/modules/install-server/files/autoinstall/partman :files/autoinstall/partman \
          | grep ' *delete mode ' | cut -d/ -f6) \
    | sort | uniq) \
  <(ssh install1002.wikimedia.org ls '/srv/autoinstall/partman/*.cfg' | cut -d/ -f5)

labvirt-ssd-sdb.cfg

That file dates from Oct 2015. The only meaningful diffs between it and labvirt-ssd.cfg is using sdb instead of sda for /. There are 0 labvirt/cloudvirt machines in the fleet for which this looks required:

cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 'cloudvirt*,labvirt*' "df /boot | tail -n1 | cut -f1 -d' '"
36 hosts will be targeted:
cloudvirt[2001-2003]-dev.codfw.wmnet,cloudvirt[1009,1012-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                 
(32) cloudvirt[1009,1012-1019,1021-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/sda1
===== NODE GROUP =====                                                                                                                                                 
(3) cloudvirt[2001-2003]-dev.codfw.wmnet                                                                                                                               
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/md0

(cloudvirt1020 skipped as it currently needs a reimage but should also use sda)

So, all of these files seem to be now-unnecessary cruft.

I am pretty sure we should be setting purge => true on the /srv/autoinstall File object installed by preseed_server.pp.

Change 491756 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 491756 merged by CDanis:
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 490404 merged by CDanis:
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

CDanis updated the task description. (Show Details)Mar 5 2019, 4:45 PM

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

@CDanis I was able to install it in the end, it was a conflict with other drive (hw RAID, in addition to the sw one) what caused issues, not the recipe. Sorry for the misreporting.