Page MenuHomePhabricator

Redundant bootloaders for software RAID
Open, LowPublic

Description

Background

72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813).

Goals

  1. Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged.
  2. DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs.
  3. (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present.

Non-goals

  • A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs
  • New monitoring infrastructure beyond simple one-off scripts
  • Performing deep modifications to debian-installer
  • Reimaging/reinstalling the fleet en masse
  • Fixing 100% of existing systems (there will be ones that aren't trivial to fix)
  • In general, anything that involves adding new moving parts to production

Plan: goal #1: correctness when freshly imaged

Done: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490404/

Two of our existing partman configs, ms-be and ms-be-legacy, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair:

d-i	grub-installer/bootdev		string	/dev/sda /dev/sdb
# this workarounds LP #1012629 / Debian #666974
# it makes grub-installer to jump to step 2, where it uses bootdev
d-i	grub-installer/only_debian		boolean false

Despite that Debian #666974 is long marked as closed, the workaround of setting only_debian to false is still necessary on stretch.

The plan here is straightforward: all partman configs that specify a partman-auto-raid/recipe will be updated to include the above grub-installer stanzas as well, with bootdev set to whatever physical disks are part of the RAID group for /boot or /.

Plan: goal #2: correctness after a drive replacement

The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running grub-install /dev/sdX. The plan: add instructions for such as a step in the new DCops runbook.

A possibility for further work: write a script that performs necessary mdadm invocations to begin repairing arrays in addition to invoking grub-install.

Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable.

Plan: goal #3: slowly fix up the fleet

In theory, all that is necessary is to invoke grub-install many times.

In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky.

Tentative plan: eventually, execute this across the entire fleet:

If the block device backing /boot is a md device:
  For each of its member partitions sdXN, run grub-install /dev/sdX

We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives.

I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2).

Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines.

Known weird stuff

  • Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077

Footnotes

†: Generated with: cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'

Event Timeline

RobH triaged this task as Medium priority.Feb 4 2019, 6:06 PM
RobH created this task.

I know very little about debian-installer, but here's a guess based on what I found in the puppet repo:

% git grep grub-installer/bootdev
modules/install_server/files/autoinstall/common.cfg:d-i grub-installer/bootdev  string  /dev/sda
modules/install_server/files/autoinstall/partman/ms-be-legacy.cfg:d-i   grub-installer/bootdev  string  /dev/sdm /dev/sdn
modules/install_server/files/autoinstall/partman/ms-be.cfg:d-i  grub-installer/bootdev  string  /dev/sda /dev/sdb
modules/install_server/files/autoinstall/virtual.cfg:d-i    grub-installer/bootdev  string default

The particular config used on thumbor2002 was raid1-lvm-ext4-srv.cfg, which -- although it sets up RAID1 between sda and sdb -- does not override the grub-installer/bootdev param from common.cfg.

Assumption 1: the partman-auto-raid directive exactly correlates with our use of Linux software RAID in production.
Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 configs must override grub-installer/bootdev to list all the relevant disks.

If both of those are true, we have a lot of configs to update (35!). Only ms-be and ms-be-legacy seem to set grub-installer/bootdev.

1cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -2 -3 <(grep -l partman-auto-raid *) <(grep -l grub-installer/bootdev *)
2aqs-cassandra-8ssd-2srv.cfg
3cassandrahosts-3ssd-jbod.cfg
4cassandrahosts-4ssd.cfg
5cassandrahosts-4ssd-jbod.cfg
6cassandrahosts-5ssd.cfg
7cassandrahosts-5ssd-jbod.cfg
8conf-lvm.cfg
9cp2018.cfg
10druid-4ssd-raid10.cfg
11elasticsearch-raid0.cfg
12ganeti-raid1.cfg
13graphite.cfg
14kubernetes-node.cfg
15logstash.cfg
16mc.cfg
17mw-raid1.cfg
18mw-raid1-lvm.cfg
19raid0-lvm-srv.cfg
20raid10-gpt.cfg
21raid10-gpt-srv-ext4.cfg
22raid10-gpt-srv-lvm-ext4-6disks.cfg
23raid10-gpt-srv-lvm-ext4.cfg
24raid10-gpt-srv-lvm-xfs.cfg
25raid1-1partition.cfg
26raid1-30G.cfg
27raid1.cfg
28raid1-gpt.cfg
29raid1-lvm.cfg
30raid1-lvm-conf.cfg
31raid1-lvm-ext4-srv.cfg
32raid1-lvm-ext4-srv-noswap.cfg
33raid1-lvm-xfs-nova.cfg
34raid5-gpt-lvm.cfg
35varnish.cfg
36varnish-oldssd.cfg

Please note this is related to T156955.

CDanis renamed this task from sw raid1 doesnt install grub on sdb to Redundant bootloaders for software RAID.Feb 13 2019, 6:26 PM
CDanis updated the task description. (Show Details)

Change 490404 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

@Joe made me aware of the existence of partman configs present on install1002 that are not in Puppet.

The good news is that almost all such files are either editor backup files (ending in ~ or .bak), or files once in Puppet but since deleted from git:

cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -1 -3 \
  <(cat <(ls *.cfg) \
        <(git log --no-renames --diff-filter=D --summary -- . :/modules/install-server/files/autoinstall/partman :files/autoinstall/partman \
          | grep ' *delete mode ' | cut -d/ -f6) \
    | sort | uniq) \
  <(ssh install1002.wikimedia.org ls '/srv/autoinstall/partman/*.cfg' | cut -d/ -f5)

labvirt-ssd-sdb.cfg

That file dates from Oct 2015. The only meaningful diffs between it and labvirt-ssd.cfg is using sdb instead of sda for /. There are 0 labvirt/cloudvirt machines in the fleet for which this looks required:

cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 'cloudvirt*,labvirt*' "df /boot | tail -n1 | cut -f1 -d' '"
36 hosts will be targeted:
cloudvirt[2001-2003]-dev.codfw.wmnet,cloudvirt[1009,1012-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                 
(32) cloudvirt[1009,1012-1019,1021-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/sda1
===== NODE GROUP =====                                                                                                                                                 
(3) cloudvirt[2001-2003]-dev.codfw.wmnet                                                                                                                               
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/md0

(cloudvirt1020 skipped as it currently needs a reimage but should also use sda)

So, all of these files seem to be now-unnecessary cruft.

I am pretty sure we should be setting purge => true on the /srv/autoinstall File object installed by preseed_server.pp.

Change 491756 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 491756 merged by CDanis:
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 490404 merged by CDanis:
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

@CDanis I was able to install it in the end, it was a conflict with other drive (hw RAID, in addition to the sw one) what caused issues, not the recipe. Sorry for the misreporting.

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Current stats as of April 27:

  • 1405 physical machines
  • 1 machine broken (mw1280)
  • 357/1405 (~25%) don't use software RAID, so are out of scope
  • 711 machines use software RAID, and have bootloaders on all their drive replicas (51% of fleet, 68% of SW RAID machines)
  • 336 machines use software RAID, but don't have properly replicated bootloaders (24% of fleet, 32% of SW RAID machines)

Here's a list of the bad machines:

1an-coord1001.eqiad.wmnet
2an-master[1001-1002].eqiad.wmnet
3aqs[1004-1009].eqiad.wmnet
4backup2002.codfw.wmnet
5bast[1002,2002,4002,5001].wikimedia.org
6cloudelastic[1001-1004].wikimedia.org
7cloudnet[1003-1004].eqiad.wmnet
8cloudvirt[2001-2003]-dev.codfw.wmnet
9conf[2001-2003].codfw.wmnet
10conf[1004-1006].eqiad.wmnet
11contint1001.wikimedia.org
12cumin2001.codfw.wmnet
13cumin1001.eqiad.wmnet
14db2093.codfw.wmnet
15db1115.eqiad.wmnet
16dbproxy[1003,1008,1012-1016].eqiad.wmnet
17deploy2001.codfw.wmnet
18deploy1001.eqiad.wmnet
19druid[1001-1006].eqiad.wmnet
20elastic[2025-2047,2049-2054].codfw.wmnet
21elastic[1032-1038,1040-1045,1047-1052].eqiad.wmnet
22eventlog1002.eqiad.wmnet
23flerovium.eqiad.wmnet
24furud.codfw.wmnet
25ganeti[1001-1004].eqiad.wmnet
26graphite2003.codfw.wmnet
27graphite1004.eqiad.wmnet
28helium.eqiad.wmnet
29icinga[1001,2001].wikimedia.org
30kubernetes[2001-2004].codfw.wmnet
31kubernetes[1001-1004].eqiad.wmnet
32kubestage[1001-1002].eqiad.wmnet
33labweb[1001-1002].wikimedia.org
34logstash[2001-2003].codfw.wmnet
35logstash[1010-1012].eqiad.wmnet
36maps1004.eqiad.wmnet
37ms-fe[2005-2006,2008].codfw.wmnet
38ms-fe[1005-1008].eqiad.wmnet
39mw[2135-2147,2151-2212,2214,2262].codfw.wmnet
40mwlog2001.codfw.wmnet
41mwlog1001.eqiad.wmnet
42mwmaint2001.codfw.wmnet
43mwmaint1002.eqiad.wmnet
44netmon[1002,2001].wikimedia.org
45notebook[1003-1004].eqiad.wmnet
46ores[2001-2009].codfw.wmnet
47ores[1001-1009].eqiad.wmnet
48oresrdb2002.codfw.wmnet
49oresrdb[1001-1002].eqiad.wmnet
50rdb[2003-2006].codfw.wmnet
51rdb[1005-1006,1009-1010].eqiad.wmnet
52relforge1001.eqiad.wmnet
53restbase[2013,2015-2018].codfw.wmnet
54restbase1016.eqiad.wmnet
55scandium.eqiad.wmnet
56scb[2001-2002,2005-2006].codfw.wmnet
57scb[1001-1004].eqiad.wmnet
58sessionstore[2001-2003].codfw.wmnet
59sessionstore[1001-1003].eqiad.wmnet
60snapshot[1008-1009].eqiad.wmnet
61stat[1005-1007].eqiad.wmnet
62thorium.eqiad.wmnet
63wdqs[2001-2006].codfw.wmnet
64wdqs[1003-1010].eqiad.wmnet
65weblog1001.eqiad.wmnet
66wtp[2001-2020].codfw.wmnet
67wtp[1025-1048].eqiad.wmnet

Going to continue to let this linger; natural reimaging activity is solving the problem well.

CDanis lowered the priority of this task from Medium to Low.Apr 27 2020, 6:31 PM

@CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple:

grub-install /dev/sdb

(I am assuming sda already has it) fix the issue?

@CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple:

grub-install /dev/sdb

(I am assuming sda already has it) fix the issue?

That ought to take care of it, yeah. I confirm it is sdb that is missing grub.

I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reimaged on Apr 15... so it's strange that it happened.

I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reimaged on Apr 15... so it's strange that it happened.

backup* hosts don't use the new standard partman recipes :-)

Ah, I see, no-srv-format.cfg doesn't set up any SW RAID at all, which means debian-installer of course won't know to install grub on both disks. I'm guessing you set up the RAID for / manually? So then this state is expected.

no-srv-format.cfg is the (wrong) recipe we use for forcing a failure so stateful services don't get accidentally reimaged (as it happened once) :-D.

Backup hosts use normally custom/backup-format.cfg when setup for the first time. This is custom because it sets up both a sw RAID and a hw RAID in the same host.

It is true, however, that I installed this one manually as we had an unrelated issue on install (not related to partman, but to dhcp server reimage).

Ah okay. It does look like backup-format.cfg contains the necessary incantations for replicated GRUB.

I've corrected manually backup2002, db1115, db2093 and will ask Manuel about the proxies, some of those are just being decommissioned or will be reimaged soon. Arguably we will not have lots of servers affected because most of ours use hw raid with a single virtual sda, not md.

Thanks for the help!

Hello people, I found this task after dealing with /dev/sda failed in a raid1 array. I thought that I had to do grub-install on /dev/sdb via d-i rescue, but then I noticed that the partman recipe was already fixed and the host was reimaged recently, so I checked in the BIOS. There is an option called Hard disk failover, that it was set to Disabled, preventing the host to try another disk if the first one listed is missing. After setting it to enabled I was able to boot correctly (before that the host went directly to PXE every time). I am writing this in here to warn other people that might get into my position in the future :D

Is there a way to audit this option and see how many hosts have it set disabled? After all this work it seems something that we'd want to keep enabled..

Is there a way to audit this option and see how many hosts have it set disabled? After all this work it seems something that we'd want to keep enabled..

If there is a way to get/set the parameter via ipmitool we could add it to spicerack and create a cookbook. however from a quick scan of ipmitool i couldn't see a way to get this information

I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q.

I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q.

Indeed. I don't know if there's a direct Redfish end point for it, but that setting does show up in the Server Configuration Profile, for example:

{ "Name": "HddFailover",
  "Value": "Disabled",
  "Set On Import": "True",
  "Comment": "Read and Write" },

This quarter I'll be working on Spicerack support for manipulating these settings, but in the mean time I have some little tools that can manipulate them if needed.

For context, the Server Configuration Profile is a Dell specific interface to setting BIOS settings across the available BIOSen in a particular box. It can be manipulated via Redfish or via other interfaces.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)