Page MenuHomePhabricator

Create partman recipe for cephosd servers
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Dec 7 2022, 1:48 PM
Referenced Files
F35982051: image.png
Jan 6 2023, 1:43 PM
F35981453: image.png
Jan 4 2023, 4:33 PM
F35965186: image.png
Jan 3 2023, 4:04 PM
F35965184: image.png
Jan 3 2023, 4:04 PM
F35965182: image.png
Jan 3 2023, 4:04 PM
F35960130: image.png
Jan 3 2023, 1:09 PM
F35880506: image.png
Dec 20 2022, 12:56 PM

Description

We need to be able to install the cephosd servers and at present we do not have a partman recipe ready.

This was intentional as the storage is complex, but now the machines are ready for installation.

Event Timeline

BTullis changed the task status from Open to In Progress.Dec 7 2022, 1:48 PM
BTullis triaged this task as High priority.
BTullis created this task.
BTullis moved this task from Incoming (new tickets) to DaaS Work on the Data-Engineering board.

I've checked the HTTPS management interface for cephosd1001 and all looks good.

  • The 12 x 16 TB HDDs are detected first, with IDs:

Physical Disk 0:2:0
to
Physical Disk 0:2:11

  • The 8 x 3.6 TB SSDs are detected next, with IDs:

Solid State Disk 0:2:16
to
Solid State Disk 0:2:23

  • Then the 2 x 480 GB Operating System SSDs are detected, with IDs:

Solid State Disk 0:2:24
and
Solid State Disk 0:2:25

  • Finally, the 6.4 TB NVMe card is detected with the ID:

PCIe SSD in Slot 2

image.png (600×1 px, 122 KB)

I suspoect that this will make the operating system see the two O/S drives as: /dev/sdu and /dev/sdv

Change 870861 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a partman recipe for the cephosd servers

https://gerrit.wikimedia.org/r/870861

Change 870861 merged by Btullis:

[operations/puppet@production] Add a partman recipe for the cephosd servers

https://gerrit.wikimedia.org/r/870861

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye

Change 870955 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Correct the filename of the partman recipe for cephosd

https://gerrit.wikimedia.org/r/870955

Change 870955 merged by Btullis:

[operations/puppet@production] Correct the filename of the partman recipe for cephosd

https://gerrit.wikimedia.org/r/870955

Given that the physical disk IDs are qll sequential and well ordered according to the iDRAC card, the results from lsscsi in the Debian installer seem very unusual.

The SCSI device IDs are almost completely alternating:

[0:0:0:0]       disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:1:0]       disk    SEAGATE ST18000NM006J   PSL9
[0:0:2:0]       disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:3:0]       disk    SEAGATE ST18000NM006J   PSL9
[0:0:4:0]       disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:5:0]       disk    SEAGATE ST18000NM006J   PSL9
[0:0:6:0]       disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:7:0]       disk    SEAGATE ST18000NM006J   PSL9
[0:0:8:0]       disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:9:0]       disk    SEAGATE ST18000NM006J   PSL9
[0:0:10:0]      disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:11:0]      disk    SEAGATE ST18000NM006J   PSL9
[0:0:12:0]      disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:13:0]      disk    SEAGATE ST18000NM006J   PSL9
[0:0:14:0]      disk    KIOXIA  KRM6VVUG3T84    BJ02
[0:0:15:0]      disk    SEAGATE ST18000NM006J   PSL9
[0:0:16:0]      disk    ATA     HFS480G3H2X069N DZ02
[0:0:17:0]      disk    SEAGATE ST18000NM006J   PSL9
[0:0:18:0]      disk    ATA     HFS480G3H2X069N DZ02
[0:0:19:0]      disk    SEAGATE ST18000NM006J   PSL9
[0:0:20:0]      enclosu DP      BP14G+EXP       2.52
[0:0:21:0]      disk    SEAGATE ST18000NM006J   PSL9

The KIOXIA devices are the SSDs and the SEAGATE devices are the HDDs.

I'm going to see if I can rationalize this detection order, rather than work around it.

Here is the dmesg output from the first two devices detected:

[   69.924961] mpt3sas_cm0: port enable: SUCCESS
[   69.925911] scsi 0:0:0:0: Direct-Access     KIOXIA   KRM6VVUG3T84     BJ02 PQ: 0 ANSI: 7
[   69.925920] scsi 0:0:0:0: SSP: handle(0x000a), sas_addr(0x58ce38ee21f3c50e), phy(4), device_name(0x58ce38ee21f3c50c)
[   69.925924] scsi 0:0:0:0: enclosure logical id (0x500056b31234abff), slot(16) 
[   69.925927] scsi 0:0:0:0: enclosure level(0x0001), connector name(     )
[   69.925931] scsi 0:0:0:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1)
[   69.926501]  end_device-0:0:1: add: handle(0x000a), sas_addr(0x58ce38ee21f3c50e)

[   69.927889] scsi 0:0:1:0: Direct-Access     SEAGATE  ST18000NM006J    PSL9 PQ: 0 ANSI: 7
[   69.927898] scsi 0:0:1:0: SSP: handle(0x0016), sas_addr(0x5000c500d9bb2bb5), phy(0), device_name(0x5000c500d9bb2bb4)
[   69.927901] scsi 0:0:1:0: enclosure logical id (0x500056b31234abff), slot(0) 
[   69.927904] scsi 0:0:1:0: enclosure level(0x0001), connector name(     )
[   69.927908] scsi 0:0:1:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1)
[   69.931580]  end_device-0:1:0: add: handle(0x0016), sas_addr(0x5000c500d9bb2bb5)

Note that it finds slot (16) first, followed by slot (0)

I suspect that this is an issue with the mpt3sas driver.

I found this in the changelog for version 30.00.00.00-1 of that driver.

image.png (25×1 px, 13 KB)

We're currently using version 35.100.00.00

~ # modinfo mpt3sas|grep version:
version:        35.100.00.00
srcversion:     7602D5B15707A30D4B2E3AA

A quick check of the serial numbers seems to cast doubt on this theory, since the serial number of /dev/sda is lower than that of /dev/sdb

~ # udevadm info --query=all --name=/dev/sda | grep ID_SERIAL
E: ID_SERIAL=358ce38ee21f3c50d
E: ID_SERIAL_SHORT=58ce38ee21f3c50d
~ # udevadm info --query=all --name=/dev/sdb | grep ID_SERIAL
E: ID_SERIAL=35000c500d9bb2bb7
E: ID_SERIAL_SHORT=5000c500d9bb2bb7

However, I think it's worth checking to see whether disabling SR/IOV in the BIOS makes any difference to the device detection order.

If it doesn't, then we might want to try a newer kernel. This post suggests that there have been several important fixes since version 38.100.00.00 which landed in kernel version 5.14.

Change 874888 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch the cephosd servers to manual partition configuration

https://gerrit.wikimedia.org/r/874888

Change 874888 merged by Btullis:

[operations/puppet@production] Switch the cephosd servers to manual partition configuration

https://gerrit.wikimedia.org/r/874888

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye

It turns out that SR-IOV was disabled in BIOS. I tried enabling it, but it didn't make any difference, so I reverted to having it disabled.

Booting into manual partition selection also revealed strange results.

image.png (388×713 px, 76 KB)

image.png (388×713 px, 74 KB)

image.png (388×713 px, 61 KB)

One of the HDDs is detected as SCSI bus ID (0,0,0) and allocated the device name: /dev/sda

Then all of the SSDs are detected as /dev/sdb to /dev/sdi, but with out-of-sequence SCSI bus IDs e.g. (0,1,0), (0,3,0),(0,2,0),(0,5,0),(0,4,0) etc.

The O/S drives are /dev/sdj and /dev/sdk

Finally, the remaining HDDs are detected, once again with out-of-sequence SCSI bus IDs.

I'm going to carry out a manual installation to /dev/sdj and /dev/sdk and then see if the backported bookwork kernel behaves any differently.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Change 875275 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the role used for the cephosd servers

https://gerrit.wikimedia.org/r/875275

Change 875275 merged by Btullis:

[operations/puppet@production] Update the role used for the cephosd servers

https://gerrit.wikimedia.org/r/875275

I've completed an installation with manual partitioning and I'm happy enough with it. Here's the output from lsblk.

btullis@cephosd1001:~$ lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda               8:0    0 447.1G  0 disk  
├─sda1            8:1    0   4.7G  0 part  
│ └─md0           9:0    0   4.7G  0 raid1 /boot
└─sda2            8:2    0 442.5G  0 part  
  └─md1           9:1    0 442.3G  0 raid1 
    ├─vg00-root 253:0    0  93.1G  0 lvm   /
    ├─vg00-var  253:1    0 186.3G  0 lvm   /var
    └─vg00-srv  253:2    0 139.7G  0 lvm   /srv
sdb               8:16   0   3.5T  0 disk  
sdc               8:32   0   3.5T  0 disk  
sdd               8:48   0   3.5T  0 disk  
sde               8:64   0   3.5T  0 disk  
sdf               8:80   0   3.5T  0 disk  
sdg               8:96   0   3.5T  0 disk  
sdh               8:112  0   3.5T  0 disk  
sdi               8:128  0   3.5T  0 disk  
sdj               8:144  0 447.1G  0 disk  
├─sdj1            8:145  0   4.7G  0 part  
│ └─md0           9:0    0   4.7G  0 raid1 /boot
└─sdj2            8:146  0 442.5G  0 part  
  └─md1           9:1    0 442.3G  0 raid1 
    ├─vg00-root 253:0    0  93.1G  0 lvm   /
    ├─vg00-var  253:1    0 186.3G  0 lvm   /var
    └─vg00-srv  253:2    0 139.7G  0 lvm   /srv
sdk               8:160  0  16.4T  0 disk  
sdl               8:176  0  16.4T  0 disk  
sdm               8:192  0  16.4T  0 disk  
sdn               8:208  0  16.4T  0 disk  
sdo               8:224  0  16.4T  0 disk  
sdp               8:240  0  16.4T  0 disk  
sdq              65:0    0  16.4T  0 disk  
sdr              65:16   0  16.4T  0 disk  
sds              65:32   0  16.4T  0 disk  
sdt              65:48   0  16.4T  0 disk  
sdu              65:64   0  16.4T  0 disk  
sdv              65:80   0  16.4T  0 disk  
nvme0n1         259:1    0   5.8T  0 disk

The O/S is installed to a sofware RAID 1 device spanning disks /dev/sda and /dev/sdj

The SSDs are detected as /dev/sdb to /dev/sdi

The HDDs are detected as /dev/sdk to /dev/sdv

The reason that the devices were detected out of order previously is related to the fact that only one drive from those connected to the HBA 330 Mini can be selected as the legacy option ROM boot device. Whichever drive is selected is detected first and allocated to /dev/sda. With the default settings the first physical disk (Physical Disk 0:2:0) was selected, which was one of the HDDs. When I manually selected the last disk (Physical Disk 0:2:25), which is one of the two 480 GB O/S SSDs, this was detected as /dev/sda and this is fine.

Now I'm going to try to repeat this with a partman recipe and then roll it out to the other cephosd servers.

Change 875290 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the partman recipe that is used for the cephosd servers

https://gerrit.wikimedia.org/r/875290

Change 875290 merged by Btullis:

[operations/puppet@production] Update the partman recipe that is used for the cephosd servers

https://gerrit.wikimedia.org/r/875290

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye completed:

  • cephosd1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301041235_btullis_2718565_cephosd1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

The partman recipe is largely working as expected, but I'm still finding some unpredictability regarding the device names.

Although the software RAID array is correctly assembled each time, I have seen the second drive come up as /dev/sdl and /dev/sdi on sequential boots, neither of which match what was in the partman recipe (/dev/sdj)

e.g. one boot

tullis@cephosd1001:~$ lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                 8:0    0 447.1G  0 disk  
├─sda1                              8:1    0   953M  0 part  
│ └─md0                             9:0    0   952M  0 raid1 /boot
├─sda2                              8:2    0   4.7G  0 part  
│ └─md1                             9:1    0   4.7G  0 raid1 [SWAP]
└─sda3                              8:3    0 441.5G  0 part  
  └─md2                             9:2    0 441.4G  0 raid1 
    ├─cephosd1001--vg-root        253:0    0 197.7G  0 lvm   /
    ├─cephosd1001--vg-var         253:1    0  68.4G  0 lvm   /var
    ├─cephosd1001--vg-srv         253:2    0  68.4G  0 lvm   /srv
    └─cephosd1001--vg-placeholder 253:3    0  18.6G  0 lvm   
sdb                                 8:16   0   3.5T  0 disk  
sdc                                 8:32   0   3.5T  0 disk  
sdd                                 8:48   0   3.5T  0 disk  
sde                                 8:64   0   3.5T  0 disk  
sdf                                 8:80   0   3.5T  0 disk  
sdg                                 8:96   0   3.5T  0 disk  
sdh                                 8:112  0   3.5T  0 disk  
sdi                                 8:128  0   3.5T  0 disk  
sdj                                 8:144  0  16.4T  0 disk  
sdk                                 8:160  0  16.4T  0 disk  
sdl                                 8:176  0 447.1G  0 disk  
├─sdl1                              8:177  0   953M  0 part  
│ └─md0                             9:0    0   952M  0 raid1 /boot
├─sdl2                              8:178  0   4.7G  0 part  
│ └─md1                             9:1    0   4.7G  0 raid1 [SWAP]
└─sdl3                              8:179  0 441.5G  0 part  
  └─md2                             9:2    0 441.4G  0 raid1 
    ├─cephosd1001--vg-root        253:0    0 197.7G  0 lvm   /
    ├─cephosd1001--vg-var         253:1    0  68.4G  0 lvm   /var
    ├─cephosd1001--vg-srv         253:2    0  68.4G  0 lvm   /srv
    └─cephosd1001--vg-placeholder 253:3    0  18.6G  0 lvm   
sdm                                 8:192  0  16.4T  0 disk  
sdn                                 8:208  0  16.4T  0 disk  
sdo                                 8:224  0  16.4T  0 disk  
sdp                                 8:240  0  16.4T  0 disk  
sdq                                65:0    0  16.4T  0 disk  
sdr                                65:16   0  16.4T  0 disk  
sds                                65:32   0  16.4T  0 disk  
sdt                                65:48   0  16.4T  0 disk  
sdu                                65:64   0  16.4T  0 disk  
sdv                                65:80   0  16.4T  0 disk  
nvme0n1                           259:1    0   5.8T  0 disk

...but the next boot:

btullis@cephosd1001:~$ lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                 8:0    0 447.1G  0 disk  
├─sda1                              8:1    0   953M  0 part  
│ └─md0                             9:0    0   952M  0 raid1 /boot
├─sda2                              8:2    0   4.7G  0 part  
│ └─md1                             9:1    0   4.7G  0 raid1 [SWAP]
└─sda3                              8:3    0 441.5G  0 part  
  └─md2                             9:2    0 441.4G  0 raid1 
    ├─cephosd1001--vg-root        253:0    0 197.7G  0 lvm   /
    ├─cephosd1001--vg-var         253:1    0  68.4G  0 lvm   /var
    ├─cephosd1001--vg-srv         253:2    0  68.4G  0 lvm   /srv
    └─cephosd1001--vg-placeholder 253:3    0  18.6G  0 lvm   
sdb                                 8:16   0   3.5T  0 disk  
sdc                                 8:32   0   3.5T  0 disk  
sdd                                 8:48   0   3.5T  0 disk  
sde                                 8:64   0   3.5T  0 disk  
sdf                                 8:80   0   3.5T  0 disk  
sdg                                 8:96   0   3.5T  0 disk  
sdh                                 8:112  0   3.5T  0 disk  
sdi                                 8:128  0 447.1G  0 disk  
├─sdi1                              8:129  0   953M  0 part  
│ └─md0                             9:0    0   952M  0 raid1 /boot
├─sdi2                              8:130  0   4.7G  0 part  
│ └─md1                             9:1    0   4.7G  0 raid1 [SWAP]
└─sdi3                              8:131  0 441.5G  0 part  
  └─md2                             9:2    0 441.4G  0 raid1 
    ├─cephosd1001--vg-root        253:0    0 197.7G  0 lvm   /
    ├─cephosd1001--vg-var         253:1    0  68.4G  0 lvm   /var
    ├─cephosd1001--vg-srv         253:2    0  68.4G  0 lvm   /srv
    └─cephosd1001--vg-placeholder 253:3    0  18.6G  0 lvm   
sdj                                 8:144  0   3.5T  0 disk  
sdk                                 8:160  0  16.4T  0 disk  
sdl                                 8:176  0  16.4T  0 disk  
sdm                                 8:192  0  16.4T  0 disk  
sdn                                 8:208  0  16.4T  0 disk  
sdo                                 8:224  0  16.4T  0 disk  
sdp                                 8:240  0  16.4T  0 disk  
sdq                                65:0    0  16.4T  0 disk  
sdr                                65:16   0  16.4T  0 disk  
sds                                65:32   0  16.4T  0 disk  
sdt                                65:48   0  16.4T  0 disk  
sdu                                65:64   0  16.4T  0 disk  
sdv                                65:80   0  16.4T  0 disk  
nvme0n1                           259:1    0   5.8T  0 disk

I'll try to find a fix for this.

I've now even seen /dev/sda and /dev/sdb swap over on a subsequent boot, with no other changes.

btullis@cephosd1001:~$ lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                 8:0    0   3.5T  0 disk  
sdb                                 8:16   0 447.1G  0 disk  
├─sdb1                              8:17   0   953M  0 part  
│ └─md0                             9:0    0   952M  0 raid1 /boot
├─sdb2                              8:18   0   4.7G  0 part  
│ └─md1                             9:1    0   4.7G  0 raid1 [SWAP]
└─sdb3                              8:19   0 441.5G  0 part  
  └─md2                             9:2    0 441.4G  0 raid1 
    ├─cephosd1001--vg-root        253:0    0 197.7G  0 lvm   /
    ├─cephosd1001--vg-var         253:1    0  68.4G  0 lvm   /var
    ├─cephosd1001--vg-srv         253:2    0  68.4G  0 lvm   /srv
    └─cephosd1001--vg-placeholder 253:3    0  18.6G  0 lvm   
sdc                                 8:32   0   3.5T  0 disk  
sdd                                 8:48   0   3.5T  0 disk  
sde                                 8:64   0   3.5T  0 disk  
sdf                                 8:80   0   3.5T  0 disk  
sdg                                 8:96   0   3.5T  0 disk  
sdh                                 8:112  0   3.5T  0 disk  
sdi                                 8:128  0   3.5T  0 disk  
sdj                                 8:144  0 447.1G  0 disk  
├─sdj1                              8:145  0   953M  0 part  
│ └─md0                             9:0    0   952M  0 raid1 /boot
├─sdj2                              8:146  0   4.7G  0 part  
│ └─md1                             9:1    0   4.7G  0 raid1 [SWAP]
└─sdj3                              8:147  0 441.5G  0 part  
  └─md2                             9:2    0 441.4G  0 raid1 
    ├─cephosd1001--vg-root        253:0    0 197.7G  0 lvm   /
    ├─cephosd1001--vg-var         253:1    0  68.4G  0 lvm   /var
    ├─cephosd1001--vg-srv         253:2    0  68.4G  0 lvm   /srv
    └─cephosd1001--vg-placeholder 253:3    0  18.6G  0 lvm   
sdk                                 8:160  0  16.4T  0 disk  
sdl                                 8:176  0  16.4T  0 disk  
sdm                                 8:192  0  16.4T  0 disk  
sdn                                 8:208  0  16.4T  0 disk  
sdo                                 8:224  0  16.4T  0 disk  
sdp                                 8:240  0  16.4T  0 disk  
sdq                                65:0    0  16.4T  0 disk  
sdr                                65:16   0  16.4T  0 disk  
sds                                65:32   0  16.4T  0 disk  
sdt                                65:48   0  16.4T  0 disk  
sdu                                65:64   0  16.4T  0 disk  
sdv                                65:80   0  16.4T  0 disk  
nvme0n1                           259:1    0   5.8T  0 disk

Looking more closely at lsscsi -v we can see that several of the device names are out of order, althouth the SCSI bus IDs are correct.

image.png (891×1 px, 362 KB)

This makes me think that the cause for this is possibly some sort of race condition in the driver.

I'm going to try the bullseye-backports kernel, to see if it's more predictable.

Change 875395 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the cephosd partman recipe again

https://gerrit.wikimedia.org/r/875395

Change 875395 merged by Btullis:

[operations/puppet@production] Update the cephosd partman recipe again

https://gerrit.wikimedia.org/r/875395

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye

Kernel 6.0.12 from backports is no better, unfortunately.

image.png (921×1 px, 373 KB)

This has version 42.100.00.00 of the mpt3sas driver

btullis@cephosd1001:~$ /sbin/modinfo mpt3sas|grep version
version:        42.100.00.00
srcversion:     BD0C3F71FA88C67CE030B9D
vermagic:       6.0.0-0.deb11.6-amd64 SMP preempt mod_unload modversions

The good thing is that the SCSI bus IDs are all consistently correct, so we will be able to refer to the /dev/disk/by-id and /dev/disk/by-path consistently when working with Ceph.

So I think that while consistent device names would be nice, we're going to have to adapt to the fact that we're not going to be able to depend on them.

I see that @jbond and @MatthewVernon recently worked on a similar issue in T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem

The result that was implemented was an autoinstall/scripts/partman_early_command.sh file containing a script to ascertain the correct device names to use.

I think that I might try the same approach here. It should be OK to use the /dev/disk/by-id here with the model number of the disk. e.g.

/sys/devices # ls -l /dev/disk/by-id/ata-HFS480G3H2X069N*
lrwxrwxrwx    1 root     root             9 Jan  4 18:21 /dev/disk/by-id/ata-HFS480G3H2X069N_ENB3N6461I2103U4C -> ../../sda
lrwxrwxrwx    1 root     root             9 Jan  4 18:21 /dev/disk/by-id/ata-HFS480G3H2X069N_ENB3N6461I2103U4G -> ../../sdj

Yeah, I think the takeaway is "you can (no longer) rely on device names being consistent between reboots".

For Ceph, though, this mostly isn't an issue - the device gets enough metadata on it that on boot ceph knows which OSD it is, so you never need to worry about fstab.

In terms of the installer, I think you might want to aim for something more reliable than model number of the drive (sadness when it's slightly changed on the next hardware order); @jbond's approach (which I've done similar to in the past) of having a small script that works out which two drives you want the installer to work on and then produces a partman recipe might be the way to go here too?

e.g. look for /sys/block/sd*/queue/rotational == 0 and size < 3T (and a sanity check this results in 2 devices)?

Change 876237 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Detect the correct disks for the O/S on the cephosd servers

https://gerrit.wikimedia.org/r/876237

Thanks @MatthewVernon - I've gone with your suggestion, with the only difference being that it's searching for SSDs < 1TB in size.
I've added you and @jbond as reviewers. Hope that's OK.

Change 876237 merged by Btullis:

[operations/puppet@production] Detect the correct disks for the O/S on the cephosd servers

https://gerrit.wikimedia.org/r/876237

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye executed with errors:

  • cephosd1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye

Change 878005 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Correct the units for the cephosd volumes

https://gerrit.wikimedia.org/r/878005

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye executed with errors:

  • cephosd1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 878005 merged by Btullis:

[operations/puppet@production] Correct the units for the cephosd volumes

https://gerrit.wikimedia.org/r/878005

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye

Change 878007 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reduce the size of the partitions on cephosd servers

https://gerrit.wikimedia.org/r/878007

Change 878007 merged by Btullis:

[operations/puppet@production] Reduce the size of the partitions on cephosd servers

https://gerrit.wikimedia.org/r/878007

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye executed with errors:

  • cephosd1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye

@BTullis Seems you have allready gone through most of the issues i went through. Some addtional things to mention

  • its best to configure the disks as JBOD (and it looks like you have), swift originally had all disks as configured as raid 0 arrays with one disk in them. this means the disks are created as virtual disks, the scis ID is changed from 0:0:N:0 -> 2:0:N:0.
  • virtual disks also change It also changed the WNN number and serial number as such the by-id is not always consistent see below. Oh it also presented all disks as rotational
  • Due to the discrepancies with virtual disks for swift we decided to uses the by-path directory which also means that when a disk is changed it will have the same mapping i.e. 0:0:4:0 will always be mounted to e.g. /srv/swift/objects4
  • i also looked at using the WNN by first querying redfish and then building the config but the wnn in redfish didn't seem to match the one in .Linux so i didn't explore further
  • on swift we also needed a swift_disks fact for the rest of the puppet config. from @MatthewVernon i don't think you need this but let me know if we need to make that a bit more generic or to add something for ceph

example by disks with a mix of virtual and direct access disks

ms-be1040 ~ % ls -la /dev/disk/by-id                                                                     [13:23:42]
total 0
drwxr-xr-x 2 root root 1480 Oct 21 14:14 .
drwxr-xr-x 8 root root  160 Oct 21 14:14 ..
lrwxrwxrwx 1 root root    9 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN -> ../../sdb
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part1 -> ../../sdb1
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part2 -> ../../sdb2
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part3 -> ../../sdb3
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part4 -> ../../sdb4
lrwxrwxrwx 1 root root    9 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN -> ../../sda
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part1 -> ../../sda1
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part2 -> ../../sda2
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part3 -> ../../sda3
lrwxrwxrwx 1 root root   10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part4 -> ../../sda4
lrwxrwxrwx 1 root root    9 Oct 25 08:14 md-name-ms-be1040:0 -> ../../md0
lrwxrwxrwx 1 root root    9 Oct 25 08:14 md-name-ms-be1040:1 -> ../../md1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 md-uuid-224e169d:9dde1acd:adc104e4:12beafec -> ../../md1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 md-uuid-d9ca5c37:710fdf2d:34dd3357:836d649c -> ../../md0
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc970709d745 -> ../../sdc
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc970709d745-part1 -> ../../sdc1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a17fa -> ../../sdd
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a17fa-part1 -> ../../sdd1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a9650 -> ../../sde
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a9650-part1 -> ../../sde1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b0b2b -> ../../sdf
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b0b2b-part1 -> ../../sdf1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b8fa7 -> ../../sdg
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b8fa7-part1 -> ../../sdg1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070bda32 -> ../../sdh
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070bda32-part1 -> ../../sdh1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070c5b77 -> ../../sdi
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070c5b77-part1 -> ../../sdi1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070cbe8f -> ../../sdj
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070cbe8f-part1 -> ../../sdj1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070d301d -> ../../sdk
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070d301d-part1 -> ../../sdk1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070dad49 -> ../../sdl
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070dad49-part1 -> ../../sdl1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e2543 -> ../../sdn
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e2543-part1 -> ../../sdn1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e9ba9 -> ../../sdm
lrwxrwxrwx 1 root root   10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e9ba9-part1 -> ../../sdm1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x55cd2e414e33c5a2 -> ../../sdb
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part1 -> ../../sdb1
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part2 -> ../../sdb2
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part3 -> ../../sdb3
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part4 -> ../../sdb4
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x55cd2e414e33c5f8 -> ../../sda
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part1 -> ../../sda1
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part2 -> ../../sda2
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part3 -> ../../sda3
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part4 -> ../../sda4
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc970709d745 -> ../../sdc
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc970709d745-part1 -> ../../sdc1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a17fa -> ../../sdd
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a17fa-part1 -> ../../sdd1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a9650 -> ../../sde
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a9650-part1 -> ../../sde1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b0b2b -> ../../sdf
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b0b2b-part1 -> ../../sdf1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b8fa7 -> ../../sdg
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b8fa7-part1 -> ../../sdg1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070bda32 -> ../../sdh
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070bda32-part1 -> ../../sdh1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070c5b77 -> ../../sdi
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070c5b77-part1 -> ../../sdi1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070cbe8f -> ../../sdj
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070cbe8f-part1 -> ../../sdj1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070d301d -> ../../sdk
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070d301d-part1 -> ../../sdk1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070dad49 -> ../../sdl
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070dad49-part1 -> ../../sdl1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e2543 -> ../../sdn
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e2543-part1 -> ../../sdn1
lrwxrwxrwx 1 root root    9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e9ba9 -> ../../sdm
lrwxrwxrwx 1 root root   10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e9ba9-part1 -> ../../sdm1
ms-be1040 ~ %

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye completed:

  • cephosd1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301101259_btullis_291579_cephosd1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

We had similar issues with cloudcephosd* hosts, where the device name would change on reboot, and we sometimes ended up with one of the small drives designed for the OS being used as an OSD and vice versa. I don't remember exactly how we fixed it, I will try to dig out the relevant patches.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye

We had similar issues with cloudcephosd* hosts, where the device name would change on reboot, and we sometimes ended up with one of the small drives designed for the OS being used as an OSD and vice versa. I don't remember exactly how we fixed it, I will try to dig out the relevant patches.

Thanks @fnegri - It looks like we're going to have this issue with any server containing many disks, so it will probably help if we work together on a generic solution.
This is what we've ended with for now: partman_early_command.sh#L29-L73|partman_early_command.sh

Would it help you if we added your recipe for the cloudcephsod* hosts to this script too?

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye completed:

  • cephosd1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301111018_btullis_541003_cephosd1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1003.eqiad.wmnet with OS bullseye completed:

  • cephosd1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301111112_btullis_553310_cephosd1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1004.eqiad.wmnet with OS bullseye completed:

  • cephosd1004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301111210_btullis_566177_cephosd1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1005.eqiad.wmnet with OS bullseye completed:

  • cephosd1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301111406_btullis_592039_cephosd1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

@BTullis Seems you have allready gone through most of the issues i went through. Some addtional things to mention

Thanks @jbond - yes I'm happy enough with this approach now, so I'll resolve this ticket.

  • its best to configure the disks as JBOD (and it looks like you have), swift originally had all disks as configured as raid 0 arrays with one disk in them. this means the disks are created as virtual disks, the scis ID is changed from 0:0:N:0 -> 2:0:N:0.

Yes, in fact I specifically requested a simple Host Bus Adapter (Dell HBA330 Mini) with these servers so there is no RAID capability and no virtual disks.

Our Hadoop worker server have also had to use the RAID0-single-disk configuration, as their controller couldn't even use JBOD mode. I thought it better to try to avoid this layer of abstraction for the Ceph servers, which is why I've selected an HBA instead of a RAID controller.

  • Due to the discrepancies with virtual disks for swift we decided to uses the by-path directory which also means that when a disk is changed it will have the same mapping i.e. 0:0:4:0 will always be mounted to e.g. /srv/swift/objects4

Yes, I think that this is very likely what we will do too. The mapping of SAS expanders and their physical ports is quite clear from /dev/disk/by-path so I'm happy to use that when we start allocating drives to Ceph.

The thing that it took me some time to understand about the SCSI ordering is that, in BIOS boot mode, this HBA selects one disk to present as the legacy boot device and assigns this to be 0:0:0:0 - everything else is detected afterwards, but is more or less assigned SCSI device LUNs in the order of the sas expanders and ports.

We could maybe avoid this promoting one (and only one) disk if we were to use UEFI to boot, but I didn't want to get into that now. I'm fine with selecting the last physical disk Solid State Disk 0:2:25 to be presented as this legacy boot device on 0:0:0:0 and I'll make a note about what I've done for these hosts here: https://wikitech.wikimedia.org/wiki/Raid_setup

  • on swift we also needed a swift_disks fact for the rest of the puppet config. from @MatthewVernon i don't think you need this but let me know if we need to make that a bit more generic or to add something for ceph

Thanks. I agree, I don't think we need it yet, but I'll come back to you if it looks like we do.

@BTullis it looks like for cloudcephosd* we used the following partman recipes: cloudcephosd1*) echo partman/standard.cfg partman/raid1-2dev.cfg ;;. My understanding is that probably we got lucky and only on few hosts the wrong drives were selected for the OS RAID, and that was probably fixed by re-partitioning them. @Andrew might remember more details here.

Would it help you if we added your recipe for the cloudcephsod* hosts to this script too?

For the moment, I would not touch the recipe for the cloud* hosts, but we can experiment with it the next time we need to add a new host, which might happen quite soon because we should have new NVMe servers coming soon(ish).

I also wanted to add (as mentioned in #wikimedia-ceph) that there are 2 moments when drive ordering is important: the first one is at partition time (partman, see above); the second one is when adding the drives to the Ceph cluster (with the ceph CLI). For the latter, we used a hacky method that checks the output of lsblk to find out which drives are in the OS RAID and which aren't. I like your idea of using /dev/disk/by-path, much cleaner. We can revisit our cookbook, and check if we can find a way to share the cookbook or the spicerack module between our teams.