Redundant bootloaders for software RAID
Open, LowPublic
Actions

Assigned To

None

Authored By

	RobH
	Feb 4 2019, 6:06 PM

Description

Background

72% of the fleet uses software RAID. But only approx. 7.6% of the fleet (or 10.6% of systems with software RAID) have a bootloader installed on multiple disks in the array.† So when sda fails and needs to be replaced, merely performing a disk swap is insufficient to return the system to service. This adds unnecessary toil for DCops and SRE (a recent example: T214813).

Goals

Fix all partman configs that set up software RAID to also install bootloaders on multiple disks. This means machines are ‘correct’ when freshly imaged.
DCops should have a method to easily re-install a bootloader after swapping a disk on a software RAID machine. This means disk swaps don’t create time bombs.
(stretch goal) As a one-time fleetwide operation, install bootloaders on most RAID members where they are not already present.

Non-goals

A from-scratch/from-first-principles rewrite or refactor of our (admittedly incredible number of) partman configs
New monitoring infrastructure beyond simple one-off scripts
Performing deep modifications to debian-installer
Reimaging/reinstalling the fleet en masse
Fixing 100% of existing systems (there will be ones that aren't trivial to fix)
In general, anything that involves adding new moving parts to production

Plan: goal #1: correctness when freshly imaged

Done: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490404/

Two of our existing partman configs, ms-be and ms-be-legacy, already set additional debian-installer preseed options that persuade it to install GRUB on both disks of the RAID1 pair:

d-i	grub-installer/bootdev		string	/dev/sda /dev/sdb
# this workarounds LP #1012629 / Debian #666974
# it makes grub-installer to jump to step 2, where it uses bootdev
d-i	grub-installer/only_debian		boolean false

Despite that Debian #666974 is long marked as closed, the workaround of setting only_debian to false is still necessary on stretch.

The plan here is straightforward: all partman configs that specify a partman-auto-raid/recipe will be updated to include the above grub-installer stanzas as well, with bootdev set to whatever physical disks are part of the RAID group for /boot or /.

Plan: goal #2: correctness after a drive replacement

The minimum requirement is that DCops has documented procedure for restoring a bootloader after replacing a disk. I believe that this should be as simple as running grub-install /dev/sdX. The plan: add instructions for such as a step in the new DCops runbook.

A possibility for further work: write a script that performs necessary mdadm invocations to begin repairing arrays in addition to invoking grub-install.

Stretch goal: Provide a fast & easy mechanism to boot a rescue GRUB via PXE menu for cases where the only existing bootable disk for a host has failed and the host is not bootable.

Plan: goal #3: slowly fix up the fleet

In theory, all that is necessary is to invoke grub-install many times.

In practice, it seems inevitable that there will be complications in doing so. It is also an operation that somehow feels risky.

Tentative plan: eventually, execute this across the entire fleet:

If the block device backing /boot is a md device:
  For each of its member partitions sdXN, run grub-install /dev/sdX

We won't start with the entire fleet; instead we'll pick a few canary hosts from several different flavors of schemes (RAID1, RAID1 with LVM, RAID10 across many disks; cross-product all of that with MBR vs GPT, etc) and verify that it runs successfully and that the machines can boot off of their other drives.

I do not think it is necessary that we fix 100% of all machines -- upwards of 90% would be great. We should have a recovery process for when the bootloader has gone MIA anyway (see stretch work in goal #2).

Since the overall purpose is to save DCops and ourselves work, we should be willing to abandon this goal if it becomes too time-consuming, or to abandon fixing some subsets of machines.

Known weird stuff

Several wdqs hosts have partition tables that do not match their partman files -- their sda1 is a type-0x0b 'Win95 FAT32' partition. Two others have sdb1 as a NTFS partition. This isn't just partition types that don't match the actual contents; there are in fact FAT32/NTFS filesystems present. https://phabricator.wikimedia.org/P8077

Footnotes

†: Generated with: cumin -p99 'F:virtual = physical' 'test -b /dev/md0 && (echo md0; head -c512 /dev/sdb|grep -q GRUB && echo sdb || echo nope) || echo no-md0'

Details

	Subject	Repo	Branch	Lines +/-
	partman: grub-install on all RAID{1,10} drives	operations/puppet	production	+171 -2
	install_server: purge old files from /srv/autoinstall	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T215183 Redundant bootloaders for software RAID
Resolved	Papaul	T215301 codfw spare pool system for partman testing
Resolved	RobH	T220842 documented procedure for replacing disks in software RAID servers

Event Timeline

RobH triaged this task as Medium priority.Feb 4 2019, 6:06 PM

RobH created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 4 2019, 6:06 PM

I know very little about debian-installer, but here's a guess based on what I found in the puppet repo:

% git grep grub-installer/bootdev
modules/install_server/files/autoinstall/common.cfg:d-i grub-installer/bootdev  string  /dev/sda
modules/install_server/files/autoinstall/partman/ms-be-legacy.cfg:d-i   grub-installer/bootdev  string  /dev/sdm /dev/sdn
modules/install_server/files/autoinstall/partman/ms-be.cfg:d-i  grub-installer/bootdev  string  /dev/sda /dev/sdb
modules/install_server/files/autoinstall/virtual.cfg:d-i    grub-installer/bootdev  string default

The particular config used on thumbor2002 was raid1-lvm-ext4-srv.cfg, which -- although it sets up RAID1 between sda and sdb -- does not override the grub-installer/bootdev param from common.cfg.

Assumption 1: the partman-auto-raid directive exactly correlates with our use of Linux software RAID in production.
Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 configs must override grub-installer/bootdev to list all the relevant disks.

If both of those are true, we have a lot of configs to update (35!). Only ms-be and ms-be-legacy seem to set grub-installer/bootdev.

P8050 partman configs with patrman-auto-raid but without grub-installer/bootdev

1	cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -2 -3 <(grep -l partman-auto-raid ) <(grep -l grub-installer/bootdev )
2	aqs-cassandra-8ssd-2srv.cfg
3	cassandrahosts-3ssd-jbod.cfg
4	cassandrahosts-4ssd.cfg
5	cassandrahosts-4ssd-jbod.cfg
6	cassandrahosts-5ssd.cfg
7	cassandrahosts-5ssd-jbod.cfg
8	conf-lvm.cfg
9	cp2018.cfg
10	druid-4ssd-raid10.cfg
11	elasticsearch-raid0.cfg
12	ganeti-raid1.cfg
13	graphite.cfg
14	kubernetes-node.cfg
15	logstash.cfg
16	mc.cfg
17	mw-raid1.cfg
18	mw-raid1-lvm.cfg
19	raid0-lvm-srv.cfg
20	raid10-gpt.cfg
21	raid10-gpt-srv-ext4.cfg
22	raid10-gpt-srv-lvm-ext4-6disks.cfg
23	raid10-gpt-srv-lvm-ext4.cfg
24	raid10-gpt-srv-lvm-xfs.cfg
25	raid1-1partition.cfg
26	raid1-30G.cfg
27	raid1.cfg
28	raid1-gpt.cfg
29	raid1-lvm.cfg
30	raid1-lvm-conf.cfg
31	raid1-lvm-ext4-srv.cfg
32	raid1-lvm-ext4-srv-noswap.cfg
33	raid1-lvm-xfs-nova.cfg
34	raid5-gpt-lvm.cfg
35	varnish.cfg
36	varnish-oldssd.cfg

Please note this is related to T156955.

jijiki subscribed.Feb 5 2019, 1:32 PM

RobH mentioned this in T215301: codfw spare pool system for partman testing.Feb 5 2019, 5:05 PM

CDanis added a subtask: T215301: codfw spare pool system for partman testing.Feb 5 2019, 8:25 PM

CDanis claimed this task.Feb 12 2019, 1:30 PM

CDanis renamed this task from sw raid1 doesnt install grub on sdb to Redundant bootloaders for software RAID.Feb 13 2019, 6:26 PM

CDanis updated the task description. (Show Details)

Change 490404 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

gerritbot added a project: Patch-For-Review.Feb 13 2019, 8:55 PM

Volans subscribed.Feb 14 2019, 11:31 AM

@Joe made me aware of the existence of partman configs present on install1002 that are not in Puppet.

The good news is that almost all such files are either editor backup files (ending in ~ or .bak), or files once in Puppet but since deleted from git:

cdanis@cdanis ~/gits/puppet/modules/install_server/files/autoinstall/partman % comm -1 -3 \
  <(cat <(ls *.cfg) \
        <(git log --no-renames --diff-filter=D --summary -- . :/modules/install-server/files/autoinstall/partman :files/autoinstall/partman \
          | grep ' *delete mode ' | cut -d/ -f6) \
    | sort | uniq) \
  <(ssh install1002.wikimedia.org ls '/srv/autoinstall/partman/*.cfg' | cut -d/ -f5)

labvirt-ssd-sdb.cfg

That file dates from Oct 2015. The only meaningful diffs between it and labvirt-ssd.cfg is using sdb instead of sda for /. There are 0 labvirt/cloudvirt machines in the fleet for which this looks required:

cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 'cloudvirt*,labvirt*' "df /boot | tail -n1 | cut -f1 -d' '"
36 hosts will be targeted:
cloudvirt[2001-2003]-dev.codfw.wmnet,cloudvirt[1009,1012-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                 
(32) cloudvirt[1009,1012-1019,1021-1030].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmnet,labvirt[1001-1008].eqiad.wmnet
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/sda1
===== NODE GROUP =====                                                                                                                                                 
(3) cloudvirt[2001-2003]-dev.codfw.wmnet                                                                                                                               
----- OUTPUT of 'df /boot | tail -n1 | cut -f1 -d' '' -----
/dev/md0

(cloudvirt1020 skipped as it currently needs a reimage but should also use sda)

So, all of these files seem to be now-unnecessary cruft.

I am pretty sure we should be setting purge => true on the /srv/autoinstall File object installed by preseed_server.pp.

Change 491756 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 491756 merged by CDanis:
[operations/puppet@production] install_server: purge old files from /srv/autoinstall

https://gerrit.wikimedia.org/r/491756

Change 490404 merged by CDanis:
[operations/puppet@production] partman: grub-install on all RAID{1,10} drives

https://gerrit.wikimedia.org/r/490404

CDanis updated the task description. (Show Details)Mar 5 2019, 4:45 PM

CDanis mentioned this in T220842: documented procedure for replacing disks in software RAID servers.Apr 12 2019, 5:58 PM

Maintenance_bot removed a project: Patch-For-Review.May 22 2019, 3:39 PM

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

In T215183#5496348, @jcrespo wrote:

Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybrid software raid + hw raid going on, and it could be a question of using the wrong recipe because of the extra disk, but FYI.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

@CDanis I was able to install it in the end, it was a conflict with other drive (hw RAID, in addition to the sw one) what caused issues, not the recipe. Sorry for the misreporting.

RobH unsubscribed.Mar 3 2020, 6:18 PM

In T215183#5575143, @CDanis wrote:

An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturally'. That's about all the work I'm going to do on this for the time being; I figure improvements will continue as services need to move to Buster.

Current stats as of April 27:

1405 physical machines
1 machine broken (mw1280)
357/1405 (~25%) don't use software RAID, so are out of scope
711 machines use software RAID, and have bootloaders on all their drive replicas (51% of fleet, 68% of SW RAID machines)
336 machines use software RAID, but don't have properly replicated bootloaders (24% of fleet, 32% of SW RAID machines)

Here's a list of the bad machines:

P11050 Masterwork From Distant Lands

1	an-coord1001.eqiad.wmnet
2	an-master[1001-1002].eqiad.wmnet
3	aqs[1004-1009].eqiad.wmnet
4	backup2002.codfw.wmnet
5	bast[1002,2002,4002,5001].wikimedia.org
6	cloudelastic[1001-1004].wikimedia.org
7	cloudnet[1003-1004].eqiad.wmnet
8	cloudvirt[2001-2003]-dev.codfw.wmnet
9	conf[2001-2003].codfw.wmnet
10	conf[1004-1006].eqiad.wmnet
11	contint1001.wikimedia.org
12	cumin2001.codfw.wmnet
13	cumin1001.eqiad.wmnet
14	db2093.codfw.wmnet
15	db1115.eqiad.wmnet
16	dbproxy[1003,1008,1012-1016].eqiad.wmnet
17	deploy2001.codfw.wmnet
18	deploy1001.eqiad.wmnet
19	druid[1001-1006].eqiad.wmnet
20	elastic[2025-2047,2049-2054].codfw.wmnet
21	elastic[1032-1038,1040-1045,1047-1052].eqiad.wmnet
22	eventlog1002.eqiad.wmnet
23	flerovium.eqiad.wmnet
24	furud.codfw.wmnet
25	ganeti[1001-1004].eqiad.wmnet
26	graphite2003.codfw.wmnet
27	graphite1004.eqiad.wmnet
28	helium.eqiad.wmnet
29	icinga[1001,2001].wikimedia.org
30	kubernetes[2001-2004].codfw.wmnet
31	kubernetes[1001-1004].eqiad.wmnet
32	kubestage[1001-1002].eqiad.wmnet
33	labweb[1001-1002].wikimedia.org
34	logstash[2001-2003].codfw.wmnet
35	logstash[1010-1012].eqiad.wmnet
36	maps1004.eqiad.wmnet
37	ms-fe[2005-2006,2008].codfw.wmnet
38	ms-fe[1005-1008].eqiad.wmnet
39	mw[2135-2147,2151-2212,2214,2262].codfw.wmnet
40	mwlog2001.codfw.wmnet
41	mwlog1001.eqiad.wmnet
42	mwmaint2001.codfw.wmnet
43	mwmaint1002.eqiad.wmnet
44	netmon[1002,2001].wikimedia.org
45	notebook[1003-1004].eqiad.wmnet
46	ores[2001-2009].codfw.wmnet
47	ores[1001-1009].eqiad.wmnet
48	oresrdb2002.codfw.wmnet
49	oresrdb[1001-1002].eqiad.wmnet
50	rdb[2003-2006].codfw.wmnet
51	rdb[1005-1006,1009-1010].eqiad.wmnet
52	relforge1001.eqiad.wmnet
53	restbase[2013,2015-2018].codfw.wmnet
54	restbase1016.eqiad.wmnet
55	scandium.eqiad.wmnet
56	scb[2001-2002,2005-2006].codfw.wmnet
57	scb[1001-1004].eqiad.wmnet
58	sessionstore[2001-2003].codfw.wmnet
59	sessionstore[1001-1003].eqiad.wmnet
60	snapshot[1008-1009].eqiad.wmnet
61	stat[1005-1007].eqiad.wmnet
62	thorium.eqiad.wmnet
63	wdqs[2001-2006].codfw.wmnet
64	wdqs[1003-1010].eqiad.wmnet
65	weblog1001.eqiad.wmnet
66	wtp[2001-2020].codfw.wmnet
67	wtp[1025-1048].eqiad.wmnet

Going to continue to let this linger; natural reimaging activity is solving the problem well.

CDanis lowered the priority of this task from Medium to Low.Apr 27 2020, 6:31 PM

@CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple:

grub-install /dev/sdb

(I am assuming sda already has it) fix the issue?

In T215183#6087811, @jcrespo wrote:
@CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple:
grub-install /dev/sdb
(I am assuming sda already has it) fix the issue?

That ought to take care of it, yeah. I confirm it is sdb that is missing grub.

I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reimaged on Apr 15... so it's strange that it happened.

In T215183#6089160, @CDanis wrote:

I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reimaged on Apr 15... so it's strange that it happened.

backup* hosts don't use the new standard partman recipes :-)

Ah, I see, no-srv-format.cfg doesn't set up any SW RAID at all, which means debian-installer of course won't know to install grub on both disks. I'm guessing you set up the RAID for / manually? So then this state is expected.

no-srv-format.cfg is the (wrong) recipe we use for forcing a failure so stateful services don't get accidentally reimaged (as it happened once) :-D.

Backup hosts use normally custom/backup-format.cfg when setup for the first time. This is custom because it sets up both a sw RAID and a hw RAID in the same host.

It is true, however, that I installed this one manually as we had an unrelated issue on install (not related to partman, but to dhcp server reimage).

Ah okay. It does look like backup-format.cfg contains the necessary incantations for replicated GRUB.

I've corrected manually backup2002, db1115, db2093 and will ask Manuel about the proxies, some of those are just being decommissioned or will be reimaged soon. Arguably we will not have lots of servers affected because most of ours use hw raid with a single virtual sda, not md.

Thanks for the help!

elukey mentioned this in T270768: Degraded RAID on an-coord1002.Jan 1 2021, 5:57 PM

RhinosF1 subscribed.Jan 1 2021, 6:06 PM

Hello people, I found this task after dealing with /dev/sda failed in a raid1 array. I thought that I had to do grub-install on /dev/sdb via d-i rescue, but then I noticed that the partman recipe was already fixed and the host was reimaged recently, so I checked in the BIOS. There is an option called Hard disk failover, that it was set to Disabled, preventing the host to try another disk if the first one listed is missing. After setting it to enabled I was able to boot correctly (before that the host went directly to PXE every time). I am writing this in here to warn other people that might get into my position in the future :D

Is there a way to audit this option and see how many hosts have it set disabled? After all this work it seems something that we'd want to keep enabled..

Is there a way to audit this option and see how many hosts have it set disabled? After all this work it seems something that we'd want to keep enabled..

If there is a way to get/set the parameter via ipmitool we could add it to spicerack and create a cookbook. however from a quick scan of ipmitool i couldn't see a way to get this information

I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q.

In T215183#6719053, @Volans wrote:

I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q.

Indeed. I don't know if there's a direct Redfish end point for it, but that setting does show up in the Server Configuration Profile, for example:

{ "Name": "HddFailover",
  "Value": "Disabled",
  "Set On Import": "True",
  "Comment": "Read and Write" },

This quarter I'll be working on Spicerack support for manipulating these settings, but in the mean time I have some little tools that can manipulate them if needed.

For context, the Server Configuration Profile is a Dell specific interface to setting BIOS settings across the available BIOSen in a particular box. It can be manipulated via Redfish or via other interfaces.

RobH closed subtask T220842: documented procedure for replacing disks in software RAID servers as Resolved.Jul 1 2021, 6:49 PM

RobH reopened subtask T220842: documented procedure for replacing disks in software RAID servers as Open.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

RobH closed subtask T220842: documented procedure for replacing disks in software RAID servers as Resolved.Oct 25 2021, 6:23 PM

Papaul closed subtask T215301: codfw spare pool system for partman testing as Resolved.Sep 16 2022, 3:21 PM

Dzahn added a project: Infrastructure-Foundations.May 10 2023, 8:28 PM

CDanis mentioned this in T376949: UEFI and software RAID.Oct 21 2024, 2:22 PM

Redundant bootloaders for software RAIDOpen, LowPublicActions