Page MenuHomePhabricator

Write new partman recipe for cloudelastic
Closed, ResolvedPublic

Description

In the last round of hardware upgrades ( T341214 ), we decided to use RAID0 instead of RAID 10 for cloudelastic hosts. This means we'll need to write a new partman recipe for the upcoming cloudelastic hosts (cloudelastic1007-10).

To do this, we'll need to:

  • Write a new partman recipe and save to ./modules/install_server/files/autoinstall/partman in the puppet repo
  • Add an entry to modules/install_server/files/autoinstall/netboot.cfg in the puppet repo.

The new recipe will need RAID-1 for the OS and RAID-0 for the remaining space.

AC:

  • new cloudelasticservers are configured with the new recipe
  • Elastic is running stably, with acceptable performance.

Event Timeline

bking mentioned this in Unknown Object (Task).Jul 21 2023, 7:31 PM

Haven't looked closely, but I'm guessing the following recipes from modules/install_server/files/autoinstall/netboot.cfg could work (or would be the easiest to adapt):

restbase1019|restbase102[0-8]|restbase103[0-3]) echo reuse-parts.cfg  partman/custom/reuse-cassandrahosts-3ssd-jbod.cfg ;; \
restbase2009|restbase201[12]) echo reuse-parts.cfg partman/custom/reuse-cassandrahosts-4ssd-jbod.cfg ;; \
restbase201[3-9]|restbase202[0-6]) echo reuse-parts.cfg partman/custom/reuse-cassandrahosts-3ssd-jbod.cfg ;; \
bking renamed this task from Write new partman recipe for cloudelastic (jbod) to Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config.Jul 24 2023, 3:37 PM
bking updated the task description. (Show Details)
Gehel triaged this task as Medium priority.Jul 25 2023, 3:50 PM
bking moved this task from Misc to In Progress on the Data-Platform-SRE board.

This is blocking T342538 , so I'm starting on it now.

Change 960114 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: new partman recipe

https://gerrit.wikimedia.org/r/960114

bking renamed this task from Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config to Write new partman recipe for cloudelastic.Sep 22 2023, 5:53 PM
bking updated the task description. (Show Details)

Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the standard raid0 recipes (i.e. raid1 for / and raid0 for /srv) would work in this case?

In other words the same as elastic* (and adjusting for the number of devices as needed)

elastic*) echo partman/standard.cfg partman/raid0.cfg partman/raid0-2dev.cfg ;; \

Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the standard raid0 recipes (i.e. raid1 for / and raid0 for /srv) would work in this case?

In other words the same as elastic* (and adjusting for the number of devices as needed)

elastic*) echo partman/standard.cfg partman/raid0.cfg partman/raid0-2dev.cfg ;; \

You must be psychic! I was just looking at the recipes again and thinking the same thing. I'll try starting with partman/raid0-3dev.cfg instead of a custom recipe and work from there.

Sounds great, please reach out and/or send reviews if sth is amiss with the standard recipes

Change 960114 merged by Bking:

[operations/puppet@production] cloudelastic: new partman recipe

https://gerrit.wikimedia.org/r/960114

Change 961186 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: correct partman recipe

https://gerrit.wikimedia.org/r/961186

Change 961186 merged by Bking:

[operations/puppet@production] cloudelastic: correct partman recipe

https://gerrit.wikimedia.org/r/961186

Change 961245 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] partman: remove all traces of cloudelastic

https://gerrit.wikimedia.org/r/961245

Change 961245 merged by Bking:

[operations/puppet@production] partman: remove all traces of cloudelastic

https://gerrit.wikimedia.org/r/961245

Still broken. It's also possible (but extremely unlikely) that the change affected other builds, see T347434 .

Confirmed that the change did not affect other image operations.

Change 961478 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: new partman recipe

https://gerrit.wikimedia.org/r/961478

Change 961478 merged by Bking:

[operations/puppet@production] cloudelastic: new partman recipe

https://gerrit.wikimedia.org/r/961478

The new recipe still errs with message`Failed to load ldlinux.c32`. That doesn't sound like a partitioning problem. Will attempt a firmware update and get back.

OK, I got the host to boot, now we're getting partman errors (fetched from /var/log/partman via install-console)

/lib/partman/init.d/25md-devices: *******************************************************
/lib/partman/init.d/30parted: *******************************************************
parted_server: ======= Starting the server
parted_server: main_loop: iteration 1
parted_server: Opening infifo
parted_server: Read command: PARTITIONS
parted_server: The device =dev=sda is not opened.
parted_server: Line 1418. CRITICAL ERROR!!!  EXITING.
/lib/partman/init.d/30parted: IN: OPEN =dev=md0 /dev/md0

Will compare the recipes again and adjust as necessary.

Change 963328 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] partman: fix raid0-3dev.cfg

https://gerrit.wikimedia.org/r/963328

Change 963328 merged by Bking:

[operations/puppet@production] partman: fix raid0-3dev.cfg

https://gerrit.wikimedia.org/r/963328

Change 963334 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: include raid0.cfg in netboot.cfg

https://gerrit.wikimedia.org/r/963334

Change 963334 merged by Bking:

[operations/puppet@production] cloudelastic: include raid0.cfg in netboot.cfg

https://gerrit.wikimedia.org/r/963334

Confirmed that the partman recipe is working via install-console:

root@cloudelastic1007:~# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda       8:0    0  1.7T  0 disk
├─sda1    8:1    0  285M  0 part
├─sda2    8:2    0 74.5G  0 part
│ └─md0   9:0    0 74.4G  0 raid1 /
├─sda3    8:3    0  977M  0 part
│ └─md1   9:1    0  976M  0 raid1 [SWAP]
└─sda4    8:4    0  1.7T  0 part
  └─md2   9:2    0    5T  0 raid0 /srv
sdb       8:16   0  1.7T  0 disk
├─sdb1    8:17   0  285M  0 part
├─sdb2    8:18   0 74.5G  0 part
│ └─md0   9:0    0 74.4G  0 raid1 /
├─sdb3    8:19   0  977M  0 part
│ └─md1   9:1    0  976M  0 raid1 [SWAP]
└─sdb4    8:20   0  1.7T  0 part
  └─md2   9:2    0    5T  0 raid0 /srv
sdc       8:32   0  1.7T  0 disk
├─sdc1    8:33   0  285M  0 part
├─sdc2    8:34   0 74.5G  0 part
│ └─md0   9:0    0 74.4G  0 raid1 /
├─sdc3    8:35   0  977M  0 part
│ └─md1   9:1    0  976M  0 raid1 [SWAP]
└─sdc4    8:36   0  1.7T  0 part
  └─md2   9:2    0    5T  0 raid0 /srv
bking moved this task from In Progress to Done on the Data-Platform-SRE board.