Page MenuHomePhabricator

Redundant bootloaders for software RAID
Open, LowPublic

Description

Background

72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813).

Goals

  1. Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged.
  2. DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs.
  3. (stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present.

Non-goals

  • A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs
  • New monitoring infrastructure beyond simple one-off scripts
  • Performing deep modifications to debian-installer
  • Reimaging/reinstalling the fleet en masse
  • Fixing 100% of existing systems (there will be ones that aren't trivial to fix)
  • In general, anything that involves adding new moving parts to production

Plan: goal #1: correctness when freshly imaged

Done: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490404/

Two of our existing partman configs, ms-be and ms-be-legacy, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair:

d-i	grub-installer/bootdev		string	/dev/sda /dev/sdb
# this workarounds LP #1012629 / Debian #666974
# it makes grub-installer to jump to step 2, where it uses bootdev
d-i	grub-installer/only_debian		boolean false

Despite that Debian #666974 is long marked as closed, the workaround of setting only_debian to false is still necessary on stretch.

The plan here is straightforward: all partman configs that specify a partman-auto-raid/recipe will be updated to include the above grub-installer stanzas as well, with bootdev set to whatever physical disks are part of the RAID group for /boot or /.

Plan: goal #2: correctness after a drive replacement

The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running grub-install /dev/sdX. The plan: add instructions for such as a step in the new DCops runbook.

A possibility for further work: write a script that performs necessary mdadm invocations to begin repairing arrays in addition to invoking grub-install.

Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable.

Plan: goal #3: slowly fix up the fleet

In theory, all that is necessary is to invoke grub-install many times.

In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky.

Tentative plan: eventually, execute this across the entire fleet:

If the block device backing /boot is a md device:
  For each of its member partitions sdXN, run grub-install /dev/sdX

We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives.

I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2).

Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines.

Known weird stuff

  • Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077

Footnotes

†: Generated with: cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'

Event Timeline

RobH triaged this task as Medium priority.

I know very little about debian-installer, but here's a guess based on what I found in the puppet repo:

% git grep grub-installer/bootdev
modules/install_server/files/autoinstall/common.cfg:d-i grub-installer/bootdev  string  /dev/sda
modules/install_server/files/autoinstall/partman/ms-be-legacy.cfg:d-i   grub-installer/bootdev  string  /dev/sdm /dev/sdn
modules/install_server/files/autoinstall/partman/ms-be.cfg:d-i  grub-installer/bootdev  string  /dev/sda /dev/sdb
modules/install_server/files/autoinstall/virtual.cfg:d-i    grub-installer/bootdev  string default

The particular config used on thumbor2002 was raid1-lvm-ext4-srv.cfg, which -- although it sets up RAID1 between sda and sdb -- does not override the grub-installer/bootdev param from common.cfg.

Assumption 1: the partman-auto-raid directive exactly correlates with our use of Linux software RAID in production.
Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 configs must override grub-installer/bootdev to list all the relevant disks.

If both of those are true, we have a lot of configs to update (35!). Only ms-be and ms-be-legacy seem to set grub-installer/bootdev.

1cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -2 -3 <(grep -l partman-auto-raid *) <(grep -l grub-installer/bootdev *)
2aqs-cassandra-8ssd-2srv.cfg
3cassandrahosts-3ssd-jbod.cfg
4cassandrahosts-4ssd.cfg
5cassandrahosts-4ssd-jbod.cfg
6cassandrahosts-5ssd.cfg
7cassandrahosts-5ssd-jbod.cfg
8conf-lvm.cfg
9cp2018.cfg
10druid-4ssd-raid10.cfg
11elasticsearch-raid0.cfg
12ganeti-raid1.cfg
13graphite.cfg
14kubernetes-node.cfg
15logstash.cfg
16mc.cfg
17mw-raid1.cfg
18mw-raid1-lvm.cfg
19raid0-lvm-srv.cfg
20raid10-gpt.cfg
21raid10-gpt-srv-ext4.cfg
22raid10-gpt-srv-lvm-ext4-6disks.cfg
23raid10-gpt-srv-lvm-ext4.cfg
24raid10-gpt-srv-lvm-xfs.cfg
25raid1-1partition.cfg
26raid1-30G.cfg
27raid1.cfg
28raid1-gpt.cfg
29raid1-lvm.cfg
30raid1-lvm-conf.cfg
31raid1-lvm-ext4-srv.cfg
32raid1-lvm-ext4-srv-noswap.cfg
33raid1-lvm-xfs-nova.cfg
34raid5-gpt-lvm.cfg
35varnish.cfg
36varnish-oldssd.cfg

Please note this is related to T156955.

CDanis renamed this task from sw raid1 doesnt install grub on sdb to Redundant bootloaders for software RAID.Feb 13 2019, 6:26 PM
CDanis updated the task description. (Show Details)

Change 490404 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

@Joe made me aware of the existence of partman configs present on install1002 that are not in Puppet.

The good news is that almost all such files are either editor backup files (ending in ~ or .bak), or files once in Puppet but since deleted from git:

cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -1 -3 \
  <(cat <(ls *.cfg) \
        <(git log --no-renames --diff-filter=D --summary -- . :/modules/install-server/files/autoinstall/partman :files/autoinstall/partman \
          | grep ' *delete mode ' | cut -d/ -f6) \
    | sort | uniq) \
  <(ssh install1002.wikimedia.org ls '/srv/autoinstall/partman/*.cfg' | cut -d/ -f5)

labvirt-ssd-sdb.cfg

That file dates from Oct 2015. The only meaningful diffs between it and labvirt-ssd.cfg is using sdb instead of sda for /. There are 0 labvirt/cloudvirt machines in the fleet for which this looks required:

cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 'cloudvirt*,labvirt*' "df /boot | tail -n1 | cut -f1 -d' '"
36 hosts will be targeted:
cloudvirt[2001-2003]-dev.codfw.wmnet,cloudvirt[1009,1012-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                 
(32) cloudvirt[1009,1012-1019,1021-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/sda1
===== NODE GROUP =====                                                                                                                                                 
(3) cloudvirt[2001-2003]-dev.codfw.wmnet                                                                                                                               
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/md0

(cloudvirt1020 skipped as it currently needs a reimage but should also use sda)

So, all of these files seem to be now-unnecessary cruft.

I am pretty sure we should be setting purge => true on the /srv/autoinstall File object installed by preseed_server.pp.

Change 491756 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 491756 merged by CDanis:
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 490404 merged by CDanis:
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

@CDanis I was able to install it in the end, it was a conflict with other drive (hw RAID, in addition to the sw one) what caused issues, not the recipe. Sorry for the misreporting.

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Current stats as of April 27:

  • 1405 physical machines
  • 1 machine broken (mw1280)
  • 357/1405 (~25%) don't use software RAID, so are out of scope
  • 711 machines use software RAID, and have bootloaders on all their drive replicas (51% of fleet, 68% of SW RAID machines)
  • 336 machines use software RAID, but don't have properly replicated bootloaders (24% of fleet, 32% of SW RAID machines)

Here's a list of the bad machines:

1an-coord1001.eqiad.wmnet
2an-master[1001-1002].eqiad.wmnet
3aqs[1004-1009].eqiad.wmnet
4backup2002.codfw.wmnet
5bast[1002,2002,4002,5001].wikimedia.org
6cloudelastic[1001-1004].wikimedia.org
7cloudnet[1003-1004].eqiad.wmnet
8cloudvirt[2001-2003]-dev.codfw.wmnet
9conf[2001-2003].codfw.wmnet
10conf[1004-1006].eqiad.wmnet
11contint1001.wikimedia.org
12cumin2001.codfw.wmnet
13cumin1001.eqiad.wmnet
14db2093.codfw.wmnet
15db1115.eqiad.wmnet
16dbproxy[1003,1008,1012-1016].eqiad.wmnet
17deploy2001.codfw.wmnet
18deploy1001.eqiad.wmnet
19druid[1001-1006].eqiad.wmnet
20elastic[2025-2047,2049-2054].codfw.wmnet
21elastic[1032-1038,1040-1045,1047-1052].eqiad.wmnet
22eventlog1002.eqiad.wmnet
23flerovium.eqiad.wmnet
24furud.codfw.wmnet
25ganeti[1001-1004].eqiad.wmnet
26graphite2003.codfw.wmnet
27graphite1004.eqiad.wmnet
28helium.eqiad.wmnet
29icinga[1001,2001].wikimedia.org
30kubernetes[2001-2004].codfw.wmnet
31kubernetes[1001-1004].eqiad.wmnet
32kubestage[1001-1002].eqiad.wmnet
33labweb[1001-1002].wikimedia.org
34logstash[2001-2003].codfw.wmnet
35logstash[1010-1012].eqiad.wmnet
36maps1004.eqiad.wmnet
37ms-fe[2005-2006,2008].codfw.wmnet
38ms-fe[1005-1008].eqiad.wmnet
39mw[2135-2147,2151-2212,2214,2262].codfw.wmnet
40mwlog2001.codfw.wmnet
41mwlog1001.eqiad.wmnet
42mwmaint2001.codfw.wmnet
43mwmaint1002.eqiad.wmnet
44netmon[1002,2001].wikimedia.org
45notebook[1003-1004].eqiad.wmnet
46ores[2001-2009].codfw.wmnet
47ores[1001-1009].eqiad.wmnet
48oresrdb2002.codfw.wmnet
49oresrdb[1001-1002].eqiad.wmnet
50rdb[2003-2006].codfw.wmnet
51rdb[1005-1006,1009-1010].eqiad.wmnet
52relforge1001.eqiad.wmnet
53restbase[2013,2015-2018].codfw.wmnet
54restbase1016.eqiad.wmnet
55scandium.eqiad.wmnet
56scb[2001-2002,2005-2006].codfw.wmnet
57scb[1001-1004].eqiad.wmnet
58sessionstore[2001-2003].codfw.wmnet
59sessionstore[1001-1003].eqiad.wmnet
60snapshot[1008-1009].eqiad.wmnet
61stat[1005-1007].eqiad.wmnet
62thorium.eqiad.wmnet
63wdqs[2001-2006].codfw.wmnet
64wdqs[1003-1010].eqiad.wmnet
65weblog1001.eqiad.wmnet
66wtp[2001-2020].codfw.wmnet
67wtp[1025-1048].eqiad.wmnet

Going to continue to let this linger; natural reimaging activity is solving the problem well.

CDanis lowered the priority of this task from Medium to Low.Apr 27 2020, 6:31 PM

@CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple:

grub-install /dev/sdb

(I am assuming sda already has it) fix the issue?

@CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple:

grub-install /dev/sdb

(I am assuming sda already has it) fix the issue?

That ought to take care of it, yeah. I confirm it is sdb that is missing grub.

I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reimaged on Apr 15... so it's strange that it happened.

I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reimaged on Apr 15... so it's strange that it happened.

backup* hosts don't use the new standard partman recipes :-)

Ah, I see, no-srv-format.cfg doesn't set up any SW RAID at all, which means debian-installer of course won't know to install grub on both disks. I'm guessing you set up the RAID for / manually? So then this state is expected.

no-srv-format.cfg is the (wrong) recipe we use for forcing a failure so stateful services don't get accidentally reimaged (as it happened once) :-D.

Backup hosts use normally custom/backup-format.cfg when setup for the first time. This is custom because it sets up both a sw RAID and a hw RAID in the same host.

It is true, however, that I installed this one manually as we had an unrelated issue on install (not related to partman, but to dhcp server reimage).

Ah okay. It does look like backup-format.cfg contains the necessary incantations for replicated GRUB.

I've corrected manually backup2002, db1115, db2093 and will ask Manuel about the proxies, some of those are just being decommissioned or will be reimaged soon. Arguably we will not have lots of servers affected because most of ours use hw raid with a single virtual sda, not md.

Thanks for the help!

Hello people, I found this task after dealing with /dev/sda failed in a raid1 array. I thought that I had to do grub-install on /dev/sdb via d-i rescue, but then I noticed that the partman recipe was already fixed and the host was reimaged recently, so I checked in the BIOS. There is an option called Hard disk failover, that it was set to Disabled, preventing the host to try another disk if the first one listed is missing. After setting it to enabled I was able to boot correctly (before that the host went directly to PXE every time). I am writing this in here to warn other people that might get into my position in the future :D

Is there a way to audit this option and see how many hosts have it set disabled? After all this work it seems something that we'd want to keep enabled..

Is there a way to audit this option and see how many hosts have it set disabled? After all this work it seems something that we'd want to keep enabled..

If there is a way to get/set the parameter via ipmitool we could add it to spicerack and create a cookbook. however from a quick scan of ipmitool i couldn't see a way to get this information

I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q.

I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q.

Indeed. I don't know if there's a direct Redfish end point for it, but that setting does show up in the Server Configuration Profile, for example:

{ "Name": "HddFailover",
  "Value": "Disabled",
  "Set On Import": "True",
  "Comment": "Read and Write" },

This quarter I'll be working on Spicerack support for manipulating these settings, but in the mean time I have some little tools that can manipulate them if needed.

For context, the Server Configuration Profile is a Dell specific interface to setting BIOS settings across the available BIOSen in a particular box. It can be manipulated via Redfish or via other interfaces.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook.

For context: We replaced sda in aqs1012 recently (T396970) and were (I believe) bit by this issue. It would seem to have been reimaged since the partman recipe was fixed, and it does not appear in the April 2020 list posted in T215183#6086396, so I'm wondering if a prior replacement didn't get the bootloader installed.

As a follow-up, I did find a device with a missing bootloader: aqs1014, which went up after it's partman recipe was fixed (it has had SSDs replaced in the years since though)

Footnotes

†: Generated with: cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'

Something else worth pointing out. Given how fast-and-loose Linux plays with device ordering, this probably isn't going to work in every case.

And not all hosts are booting from a RAID1, the aqs cluster at least uses a RAID10, which combined with variability of device ordering makes it trickier still to find the candidates.

Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook.

Good question, I think probably not. @RobH or @wiki_willy is this on your radar?

For context: We replaced sda in aqs1012 recently (T396970) and were (I believe) bit by this issue. It would seem to have been reimaged since the partman recipe was fixed, and it does not appear in the April 2020 list posted in T215183#6086396, so I'm wondering if a prior replacement didn't get the bootloader installed.

@Eevans, can you check in the BIOS settings of aqs1012 to see if a setting like "Hard drive failover" exists, per T215183#6718961 ?

I also never spent much time looking at or thinking about RAID10 hosts, as you said. Honestly I don't remember what debian-installer does in the first place for RAID10 and bootloaders.

[ ... ]

For context: We replaced sda in aqs1012 recently (T396970) and were (I believe) bit by this issue. It would seem to have been reimaged since the partman recipe was fixed, and it does not appear in the April 2020 list posted in T215183#6086396, so I'm wondering if a prior replacement didn't get the bootloader installed.

@Eevans, can you check in the BIOS settings of aqs1012 to see if a setting like "Hard drive failover" exists, per T215183#6718961 ?

It does, and I set it enabled after the device had been replaced and wouldn't boot (hint: it still didn't :( ).

aqs1014 was missing a bootloader for sdb as well (I've since fixed that), so if it had subsequently lost sda (like aqs1012), I think it would have manifested exactly the same.

[ ... ]

I also never spent much time looking at or thinking about RAID10 hosts, as you said. Honestly I don't remember what debian-installer does in the first place for RAID10 and bootloaders.

This config uses modules/install_server/files/autoinstall/partman/custom/aqs-cassandra-8ssd-2srv.cfg which you fixed back in 2019, it should be using all of /dev/sd[a-d]. For most of the aqs hosts that would seem to be the case, too. Now I'm wondering what the other 6 devices (it's supposed to have 8, but one had been replaced, and another failed) looked like. You would think that even with generous reordering, chances would be very good you'd find at least one bootloader among the first 4!

Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook.

For context: We replaced sda in aqs1012 recently (T396970) and were (I believe) bit by this issue. It would seem to have been reimaged since the partman recipe was fixed, and it does not appear in the April 2020 list posted in T215183#6086396, so I'm wondering if a prior replacement didn't get the bootloader installed.

Answering myself here: I see now that T220842 is a subtask, and that it has been closed as resolved (with https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions cited as the result), but I don't see anything there with respect to installing the bootloader. I'm happy to take a stab at adding something —assuming this is the canonical location, and that we're in agreement on what should be done..

Ok, so in an attempt to summarize things:

  • It seems that goal no. 1 is complete, all partman preseeds have been updated
  • Goal no. 3 might need to be revisited with a mind toward all of the hosts that have been reimaged, but have since had disks replaced, and didn't get a bootloader (re)installed. This is Real™, I've already found examples during spot-checking. Anyone doing so should be aware that the cumin copypasta in the description is only an approximation (for example it assumes raid1, and it doesn't take into account any reordering of devices)
  • Bonus: bulk update the fleet for the BIOS setting that enables boot device failover
  • Last (but definitely not least), make sure that the runbook prominently covers the installation of a bootloader when an applicable device is replaced. It probably also makes sense to document the BIOS setting in this context (see point above)

Thinking out loud here, but I wonder if there would be any harm in putting a bootloader on all storage devices? If not, it would be pretty simple to put together a cookbook that iterated storage devices and invoked grub-install $dev for each.