Page MenuHomePhabricator

partman vs cloudcephosd1012
Closed, ResolvedPublic

Description

I can't reimage cloudcephosd1012 because of partman changes that are no longer compatible with this host.

The first big batch of cloudceph nodes (cloudcephosd10[04-15]) have raid controllers, ordered by mistake. There should still be two small OS drives and a bunch of unpartitioned big drives, but the partman process fails saying that there are 10 bootable drives.

That's caused by this snippet here:

+  # Double checking that we have exactly two SCSI devices
+  num_devices=$(echo ${devices} | egrep -o '\/dev\/sd[a-z]'|wc -l)
+  if [ ${num_devices} -ne 2 ]
+  then
+    echo "We expected to find two boot devices, but instead found ${num_devices}".
+    exit 1
+  fi

Although by my reading it should complain about 0 bootable drives rather than 10.

Topics for discussion:

  • is the new partman recipe likely to work for all the rest of our osd nodes? Do they all conform to the 2-scsci/the rest non-scsi pattern assumed here?
  • assuming 'yes,' what's a good clean way to opt these few servers out of that check?
  • dcaro seems to have already reimaged cloudcephosd1006-1011. How?
  • should we just scrap cloudcephosd10[04-15] rather than deal with this exception? They're due for refresh in 2025.

Event Timeline

is the new partman recipe likely to work for all the rest of our osd nodes? Do they all conform to the 2-scsci/the rest non-scsi pattern assumed here?

The answer to that seems to be 'no'. The next batch of servers we bought, 1016-1020 (T271239) also have a raid controller and just sata drives for both OS and storage.

Andrew triaged this task as Medium priority.Jan 15 2025, 11:22 PM

dcaro seems to have already reimaged cloudcephosd1006-1011. How?

I did not do anything special that I remember, besides having to retry a few times :/

Change #1113451 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] configure_cephosd_disks(): don't give up if scsi drive count is wrong

https://gerrit.wikimedia.org/r/1113451

Change #1113451 merged by Andrew Bogott:

[operations/puppet@production] configure_cephosd_disks(): Assume os drives are less than 1.5TB

https://gerrit.wikimedia.org/r/1113451

With the above patch, the partman script is now recognizing the OS drives correctly (I think). However, it now generates a totally incoherent recipe:

d-i partman-auto/disk   string /dev/sda /dev/sdb
d-i grub-installer/bootdev  string  /dev/sda /dev/sdb
# DEBUG devices is /dev/sda /dev/sdb
# DEBUG boot_parts is /dev/sda1#/dev/sdb1
# DEBUG swap_parts is /dev/sda2#/dev/sdb2
# DEBUG root_parts is /dev/sda3#/dev/sdb3
#
# Parameters are:
# <raidtype> <devcount> <sparecount> <fstype> <mountpoint> \
#   <devices> <sparedevices>
d-i partman-auto-raid/recipe string          1    2    0    ext4    /boot      dev/sdb3              .     1    2    0    lvm    -                  /dev/sda3#/d

# DEBUG and that's the end

Either there's a second pass that's garbling things or there's some serious quoting issue with the code that produces that 'd-i partman-auto-raid' line. When I run the same script locally (with $devices pre-filled-out) I get something more reasonable:

d-i partman-auto-raid/recipe string          1    2    0    ext4    /boot             /dev/sda1#/dev/sdb1              .                                    1    2    0    swap    -                 /dev/sda2#/dev/sdb2              .                                    1    2    0    lvm    -                  /dev/sda3#/dev/sdb3              .

I've confirmed that in the debian shell console the script generates the same incoherent /tmp/dynamic_disc.cfg

Change #1113529 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] partman_early_command: fix cephosd recipe

https://gerrit.wikimedia.org/r/1113529

Change #1113529 merged by Andrew Bogott:

[operations/puppet@production] partman_early_command: fix cephosd recipe

https://gerrit.wikimedia.org/r/1113529

Now it looks right to me

d-i partman-auto/disk   string /dev/sda /dev/sdb
d-i grub-installer/bootdev  string  /dev/sda /dev/sdb
# Parameters are:
# <raidtype> <devcount> <sparecount> <fstype> <mountpoint> \
#   <devices> <sparedevices>
d-i partman-auto-raid/recipe string  \
        1    2    0    ext4    /boot \
            /dev/sda1#/dev/sdb1      \
        .                            \
        1    2    0    swap    -     \
            /dev/sda2#/dev/sdb2      \
        .                            \
        1    2    0    lvm    -      \
            /dev/sda3#/dev/sdb3      \
        .

and it gets slightly further but still fails on raid creation

Still fails.

Error while setting up RAID

<3 to spend another day with partman

Change #1113595 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cephosd.cfg partman: reduce minimum partition sizes

https://gerrit.wikimedia.org/r/1113595

Change #1113595 merged by Andrew Bogott:

[operations/puppet@production] cephosd.cfg partman: reduce minimum partition sizes

https://gerrit.wikimedia.org/r/1113595

Change #1113597 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cephosd.cfg partman: reduce minimum partition sizes, again

https://gerrit.wikimedia.org/r/1113597

Change #1113597 merged by Andrew Bogott:

[operations/puppet@production] cephosd.cfg partman: reduce minimum partition sizes, again

https://gerrit.wikimedia.org/r/1113597

After some more hatchet work this now images correctly. No idea if it'll work for other bigger osd nodes but I'm pretty sure it didn't work before either.

After some more hatchet work this now images correctly. No idea if it'll work for other bigger osd nodes but I'm pretty sure it didn't work before either.

Thanks @Andrew!

Andrew claimed this task.