We need to be able to install the cephosd servers and at present we do not have a partman recipe ready.
This was intentional as the storage is complex, but now the machines are ready for installation.
BTullis | |
Dec 7 2022, 1:48 PM |
F35982051: image.png | |
Jan 6 2023, 1:43 PM |
F35981453: image.png | |
Jan 4 2023, 4:33 PM |
F35965186: image.png | |
Jan 3 2023, 4:04 PM |
F35965184: image.png | |
Jan 3 2023, 4:04 PM |
F35965182: image.png | |
Jan 3 2023, 4:04 PM |
F35960130: image.png | |
Jan 3 2023, 1:09 PM |
F35880506: image.png | |
Dec 20 2022, 12:56 PM |
We need to be able to install the cephosd servers and at present we do not have a partman recipe ready.
This was intentional as the storage is complex, but now the machines are ready for installation.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Gehel | T327267 Create a DSE Kubernetes cluster with support for persistent storage from Ceph | |||
Resolved | Gehel | T324660 Install Ceph Cluster for Data Platform Engineering | |||
Resolved | BTullis | T324670 Create partman recipe for cephosd servers |
I've checked the HTTPS management interface for cephosd1001 and all looks good.
Physical Disk 0:2:0
to
Physical Disk 0:2:11
Solid State Disk 0:2:16
to
Solid State Disk 0:2:23
Solid State Disk 0:2:24
and
Solid State Disk 0:2:25
PCIe SSD in Slot 2
I suspoect that this will make the operating system see the two O/S drives as: /dev/sdu and /dev/sdv
Change 870861 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Add a partman recipe for the cephosd servers
Change 870861 merged by Btullis:
[operations/puppet@production] Add a partman recipe for the cephosd servers
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye
Change 870955 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Correct the filename of the partman recipe for cephosd
Change 870955 merged by Btullis:
[operations/puppet@production] Correct the filename of the partman recipe for cephosd
Given that the physical disk IDs are qll sequential and well ordered according to the iDRAC card, the results from lsscsi in the Debian installer seem very unusual.
The SCSI device IDs are almost completely alternating:
[0:0:0:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:1:0] disk SEAGATE ST18000NM006J PSL9 [0:0:2:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:3:0] disk SEAGATE ST18000NM006J PSL9 [0:0:4:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:5:0] disk SEAGATE ST18000NM006J PSL9 [0:0:6:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:7:0] disk SEAGATE ST18000NM006J PSL9 [0:0:8:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:9:0] disk SEAGATE ST18000NM006J PSL9 [0:0:10:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:11:0] disk SEAGATE ST18000NM006J PSL9 [0:0:12:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:13:0] disk SEAGATE ST18000NM006J PSL9 [0:0:14:0] disk KIOXIA KRM6VVUG3T84 BJ02 [0:0:15:0] disk SEAGATE ST18000NM006J PSL9 [0:0:16:0] disk ATA HFS480G3H2X069N DZ02 [0:0:17:0] disk SEAGATE ST18000NM006J PSL9 [0:0:18:0] disk ATA HFS480G3H2X069N DZ02 [0:0:19:0] disk SEAGATE ST18000NM006J PSL9 [0:0:20:0] enclosu DP BP14G+EXP 2.52 [0:0:21:0] disk SEAGATE ST18000NM006J PSL9
The KIOXIA devices are the SSDs and the SEAGATE devices are the HDDs.
I'm going to see if I can rationalize this detection order, rather than work around it.
Here is the dmesg output from the first two devices detected:
[ 69.924961] mpt3sas_cm0: port enable: SUCCESS [ 69.925911] scsi 0:0:0:0: Direct-Access KIOXIA KRM6VVUG3T84 BJ02 PQ: 0 ANSI: 7 [ 69.925920] scsi 0:0:0:0: SSP: handle(0x000a), sas_addr(0x58ce38ee21f3c50e), phy(4), device_name(0x58ce38ee21f3c50c) [ 69.925924] scsi 0:0:0:0: enclosure logical id (0x500056b31234abff), slot(16) [ 69.925927] scsi 0:0:0:0: enclosure level(0x0001), connector name( ) [ 69.925931] scsi 0:0:0:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1) [ 69.926501] end_device-0:0:1: add: handle(0x000a), sas_addr(0x58ce38ee21f3c50e) [ 69.927889] scsi 0:0:1:0: Direct-Access SEAGATE ST18000NM006J PSL9 PQ: 0 ANSI: 7 [ 69.927898] scsi 0:0:1:0: SSP: handle(0x0016), sas_addr(0x5000c500d9bb2bb5), phy(0), device_name(0x5000c500d9bb2bb4) [ 69.927901] scsi 0:0:1:0: enclosure logical id (0x500056b31234abff), slot(0) [ 69.927904] scsi 0:0:1:0: enclosure level(0x0001), connector name( ) [ 69.927908] scsi 0:0:1:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1) [ 69.931580] end_device-0:1:0: add: handle(0x0016), sas_addr(0x5000c500d9bb2bb5)
Note that it finds slot (16) first, followed by slot (0)
I suspect that this is an issue with the mpt3sas driver.
I found this in the changelog for version 30.00.00.00-1 of that driver.
We're currently using version 35.100.00.00
~ # modinfo mpt3sas|grep version: version: 35.100.00.00 srcversion: 7602D5B15707A30D4B2E3AA
A quick check of the serial numbers seems to cast doubt on this theory, since the serial number of /dev/sda is lower than that of /dev/sdb
~ # udevadm info --query=all --name=/dev/sda | grep ID_SERIAL E: ID_SERIAL=358ce38ee21f3c50d E: ID_SERIAL_SHORT=58ce38ee21f3c50d ~ # udevadm info --query=all --name=/dev/sdb | grep ID_SERIAL E: ID_SERIAL=35000c500d9bb2bb7 E: ID_SERIAL_SHORT=5000c500d9bb2bb7
However, I think it's worth checking to see whether disabling SR/IOV in the BIOS makes any difference to the device detection order.
If it doesn't, then we might want to try a newer kernel. This post suggests that there have been several important fixes since version 38.100.00.00 which landed in kernel version 5.14.
Change 874888 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Switch the cephosd servers to manual partition configuration
Change 874888 merged by Btullis:
[operations/puppet@production] Switch the cephosd servers to manual partition configuration
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye
It turns out that SR-IOV was disabled in BIOS. I tried enabling it, but it didn't make any difference, so I reverted to having it disabled.
Booting into manual partition selection also revealed strange results.
One of the HDDs is detected as SCSI bus ID (0,0,0) and allocated the device name: /dev/sda
Then all of the SSDs are detected as /dev/sdb to /dev/sdi, but with out-of-sequence SCSI bus IDs e.g. (0,1,0), (0,3,0),(0,2,0),(0,5,0),(0,4,0) etc.
The O/S drives are /dev/sdj and /dev/sdk
Finally, the remaining HDDs are detected, once again with out-of-sequence SCSI bus IDs.
I'm going to carry out a manual installation to /dev/sdj and /dev/sdk and then see if the backported bookwork kernel behaves any differently.
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye executed with errors:
Change 875275 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Update the role used for the cephosd servers
Change 875275 merged by Btullis:
[operations/puppet@production] Update the role used for the cephosd servers
I've completed an installation with manual partitioning and I'm happy enough with it. Here's the output from lsblk.
btullis@cephosd1001:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 447.1G 0 disk ├─sda1 8:1 0 4.7G 0 part │ └─md0 9:0 0 4.7G 0 raid1 /boot └─sda2 8:2 0 442.5G 0 part └─md1 9:1 0 442.3G 0 raid1 ├─vg00-root 253:0 0 93.1G 0 lvm / ├─vg00-var 253:1 0 186.3G 0 lvm /var └─vg00-srv 253:2 0 139.7G 0 lvm /srv sdb 8:16 0 3.5T 0 disk sdc 8:32 0 3.5T 0 disk sdd 8:48 0 3.5T 0 disk sde 8:64 0 3.5T 0 disk sdf 8:80 0 3.5T 0 disk sdg 8:96 0 3.5T 0 disk sdh 8:112 0 3.5T 0 disk sdi 8:128 0 3.5T 0 disk sdj 8:144 0 447.1G 0 disk ├─sdj1 8:145 0 4.7G 0 part │ └─md0 9:0 0 4.7G 0 raid1 /boot └─sdj2 8:146 0 442.5G 0 part └─md1 9:1 0 442.3G 0 raid1 ├─vg00-root 253:0 0 93.1G 0 lvm / ├─vg00-var 253:1 0 186.3G 0 lvm /var └─vg00-srv 253:2 0 139.7G 0 lvm /srv sdk 8:160 0 16.4T 0 disk sdl 8:176 0 16.4T 0 disk sdm 8:192 0 16.4T 0 disk sdn 8:208 0 16.4T 0 disk sdo 8:224 0 16.4T 0 disk sdp 8:240 0 16.4T 0 disk sdq 65:0 0 16.4T 0 disk sdr 65:16 0 16.4T 0 disk sds 65:32 0 16.4T 0 disk sdt 65:48 0 16.4T 0 disk sdu 65:64 0 16.4T 0 disk sdv 65:80 0 16.4T 0 disk nvme0n1 259:1 0 5.8T 0 disk
The O/S is installed to a sofware RAID 1 device spanning disks /dev/sda and /dev/sdj
The SSDs are detected as /dev/sdb to /dev/sdi
The HDDs are detected as /dev/sdk to /dev/sdv
The reason that the devices were detected out of order previously is related to the fact that only one drive from those connected to the HBA 330 Mini can be selected as the legacy option ROM boot device. Whichever drive is selected is detected first and allocated to /dev/sda. With the default settings the first physical disk (Physical Disk 0:2:0) was selected, which was one of the HDDs. When I manually selected the last disk (Physical Disk 0:2:25), which is one of the two 480 GB O/S SSDs, this was detected as /dev/sda and this is fine.
Now I'm going to try to repeat this with a partman recipe and then roll it out to the other cephosd servers.
Change 875290 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Update the partman recipe that is used for the cephosd servers
Change 875290 merged by Btullis:
[operations/puppet@production] Update the partman recipe that is used for the cephosd servers
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye completed:
The partman recipe is largely working as expected, but I'm still finding some unpredictability regarding the device names.
Although the software RAID array is correctly assembled each time, I have seen the second drive come up as /dev/sdl and /dev/sdi on sequential boots, neither of which match what was in the partman recipe (/dev/sdj)
e.g. one boot
tullis@cephosd1001:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 447.1G 0 disk ├─sda1 8:1 0 953M 0 part │ └─md0 9:0 0 952M 0 raid1 /boot ├─sda2 8:2 0 4.7G 0 part │ └─md1 9:1 0 4.7G 0 raid1 [SWAP] └─sda3 8:3 0 441.5G 0 part └─md2 9:2 0 441.4G 0 raid1 ├─cephosd1001--vg-root 253:0 0 197.7G 0 lvm / ├─cephosd1001--vg-var 253:1 0 68.4G 0 lvm /var ├─cephosd1001--vg-srv 253:2 0 68.4G 0 lvm /srv └─cephosd1001--vg-placeholder 253:3 0 18.6G 0 lvm sdb 8:16 0 3.5T 0 disk sdc 8:32 0 3.5T 0 disk sdd 8:48 0 3.5T 0 disk sde 8:64 0 3.5T 0 disk sdf 8:80 0 3.5T 0 disk sdg 8:96 0 3.5T 0 disk sdh 8:112 0 3.5T 0 disk sdi 8:128 0 3.5T 0 disk sdj 8:144 0 16.4T 0 disk sdk 8:160 0 16.4T 0 disk sdl 8:176 0 447.1G 0 disk ├─sdl1 8:177 0 953M 0 part │ └─md0 9:0 0 952M 0 raid1 /boot ├─sdl2 8:178 0 4.7G 0 part │ └─md1 9:1 0 4.7G 0 raid1 [SWAP] └─sdl3 8:179 0 441.5G 0 part └─md2 9:2 0 441.4G 0 raid1 ├─cephosd1001--vg-root 253:0 0 197.7G 0 lvm / ├─cephosd1001--vg-var 253:1 0 68.4G 0 lvm /var ├─cephosd1001--vg-srv 253:2 0 68.4G 0 lvm /srv └─cephosd1001--vg-placeholder 253:3 0 18.6G 0 lvm sdm 8:192 0 16.4T 0 disk sdn 8:208 0 16.4T 0 disk sdo 8:224 0 16.4T 0 disk sdp 8:240 0 16.4T 0 disk sdq 65:0 0 16.4T 0 disk sdr 65:16 0 16.4T 0 disk sds 65:32 0 16.4T 0 disk sdt 65:48 0 16.4T 0 disk sdu 65:64 0 16.4T 0 disk sdv 65:80 0 16.4T 0 disk nvme0n1 259:1 0 5.8T 0 disk
...but the next boot:
btullis@cephosd1001:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 447.1G 0 disk ├─sda1 8:1 0 953M 0 part │ └─md0 9:0 0 952M 0 raid1 /boot ├─sda2 8:2 0 4.7G 0 part │ └─md1 9:1 0 4.7G 0 raid1 [SWAP] └─sda3 8:3 0 441.5G 0 part └─md2 9:2 0 441.4G 0 raid1 ├─cephosd1001--vg-root 253:0 0 197.7G 0 lvm / ├─cephosd1001--vg-var 253:1 0 68.4G 0 lvm /var ├─cephosd1001--vg-srv 253:2 0 68.4G 0 lvm /srv └─cephosd1001--vg-placeholder 253:3 0 18.6G 0 lvm sdb 8:16 0 3.5T 0 disk sdc 8:32 0 3.5T 0 disk sdd 8:48 0 3.5T 0 disk sde 8:64 0 3.5T 0 disk sdf 8:80 0 3.5T 0 disk sdg 8:96 0 3.5T 0 disk sdh 8:112 0 3.5T 0 disk sdi 8:128 0 447.1G 0 disk ├─sdi1 8:129 0 953M 0 part │ └─md0 9:0 0 952M 0 raid1 /boot ├─sdi2 8:130 0 4.7G 0 part │ └─md1 9:1 0 4.7G 0 raid1 [SWAP] └─sdi3 8:131 0 441.5G 0 part └─md2 9:2 0 441.4G 0 raid1 ├─cephosd1001--vg-root 253:0 0 197.7G 0 lvm / ├─cephosd1001--vg-var 253:1 0 68.4G 0 lvm /var ├─cephosd1001--vg-srv 253:2 0 68.4G 0 lvm /srv └─cephosd1001--vg-placeholder 253:3 0 18.6G 0 lvm sdj 8:144 0 3.5T 0 disk sdk 8:160 0 16.4T 0 disk sdl 8:176 0 16.4T 0 disk sdm 8:192 0 16.4T 0 disk sdn 8:208 0 16.4T 0 disk sdo 8:224 0 16.4T 0 disk sdp 8:240 0 16.4T 0 disk sdq 65:0 0 16.4T 0 disk sdr 65:16 0 16.4T 0 disk sds 65:32 0 16.4T 0 disk sdt 65:48 0 16.4T 0 disk sdu 65:64 0 16.4T 0 disk sdv 65:80 0 16.4T 0 disk nvme0n1 259:1 0 5.8T 0 disk
I'll try to find a fix for this.
I've now even seen /dev/sda and /dev/sdb swap over on a subsequent boot, with no other changes.
btullis@cephosd1001:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.5T 0 disk sdb 8:16 0 447.1G 0 disk ├─sdb1 8:17 0 953M 0 part │ └─md0 9:0 0 952M 0 raid1 /boot ├─sdb2 8:18 0 4.7G 0 part │ └─md1 9:1 0 4.7G 0 raid1 [SWAP] └─sdb3 8:19 0 441.5G 0 part └─md2 9:2 0 441.4G 0 raid1 ├─cephosd1001--vg-root 253:0 0 197.7G 0 lvm / ├─cephosd1001--vg-var 253:1 0 68.4G 0 lvm /var ├─cephosd1001--vg-srv 253:2 0 68.4G 0 lvm /srv └─cephosd1001--vg-placeholder 253:3 0 18.6G 0 lvm sdc 8:32 0 3.5T 0 disk sdd 8:48 0 3.5T 0 disk sde 8:64 0 3.5T 0 disk sdf 8:80 0 3.5T 0 disk sdg 8:96 0 3.5T 0 disk sdh 8:112 0 3.5T 0 disk sdi 8:128 0 3.5T 0 disk sdj 8:144 0 447.1G 0 disk ├─sdj1 8:145 0 953M 0 part │ └─md0 9:0 0 952M 0 raid1 /boot ├─sdj2 8:146 0 4.7G 0 part │ └─md1 9:1 0 4.7G 0 raid1 [SWAP] └─sdj3 8:147 0 441.5G 0 part └─md2 9:2 0 441.4G 0 raid1 ├─cephosd1001--vg-root 253:0 0 197.7G 0 lvm / ├─cephosd1001--vg-var 253:1 0 68.4G 0 lvm /var ├─cephosd1001--vg-srv 253:2 0 68.4G 0 lvm /srv └─cephosd1001--vg-placeholder 253:3 0 18.6G 0 lvm sdk 8:160 0 16.4T 0 disk sdl 8:176 0 16.4T 0 disk sdm 8:192 0 16.4T 0 disk sdn 8:208 0 16.4T 0 disk sdo 8:224 0 16.4T 0 disk sdp 8:240 0 16.4T 0 disk sdq 65:0 0 16.4T 0 disk sdr 65:16 0 16.4T 0 disk sds 65:32 0 16.4T 0 disk sdt 65:48 0 16.4T 0 disk sdu 65:64 0 16.4T 0 disk sdv 65:80 0 16.4T 0 disk nvme0n1 259:1 0 5.8T 0 disk
Looking more closely at lsscsi -v we can see that several of the device names are out of order, althouth the SCSI bus IDs are correct.
I'm going to try the bullseye-backports kernel, to see if it's more predictable.
Change 875395 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Update the cephosd partman recipe again
Change 875395 merged by Btullis:
[operations/puppet@production] Update the cephosd partman recipe again
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye
Kernel 6.0.12 from backports is no better, unfortunately.
btullis@cephosd1001:~$ /sbin/modinfo mpt3sas|grep version version: 42.100.00.00 srcversion: BD0C3F71FA88C67CE030B9D vermagic: 6.0.0-0.deb11.6-amd64 SMP preempt mod_unload modversions
The good thing is that the SCSI bus IDs are all consistently correct, so we will be able to refer to the /dev/disk/by-id and /dev/disk/by-path consistently when working with Ceph.
So I think that while consistent device names would be nice, we're going to have to adapt to the fact that we're not going to be able to depend on them.
I see that @jbond and @MatthewVernon recently worked on a similar issue in T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem
The result that was implemented was an autoinstall/scripts/partman_early_command.sh file containing a script to ascertain the correct device names to use.
I think that I might try the same approach here. It should be OK to use the /dev/disk/by-id here with the model number of the disk. e.g.
/sys/devices # ls -l /dev/disk/by-id/ata-HFS480G3H2X069N* lrwxrwxrwx 1 root root 9 Jan 4 18:21 /dev/disk/by-id/ata-HFS480G3H2X069N_ENB3N6461I2103U4C -> ../../sda lrwxrwxrwx 1 root root 9 Jan 4 18:21 /dev/disk/by-id/ata-HFS480G3H2X069N_ENB3N6461I2103U4G -> ../../sdj
Yeah, I think the takeaway is "you can (no longer) rely on device names being consistent between reboots".
For Ceph, though, this mostly isn't an issue - the device gets enough metadata on it that on boot ceph knows which OSD it is, so you never need to worry about fstab.
In terms of the installer, I think you might want to aim for something more reliable than model number of the drive (sadness when it's slightly changed on the next hardware order); @jbond's approach (which I've done similar to in the past) of having a small script that works out which two drives you want the installer to work on and then produces a partman recipe might be the way to go here too?
e.g. look for /sys/block/sd*/queue/rotational == 0 and size < 3T (and a sanity check this results in 2 devices)?
Change 876237 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Detect the correct disks for the O/S on the cephosd servers
Thanks @MatthewVernon - I've gone with your suggestion, with the only difference being that it's searching for SSDs < 1TB in size.
I've added you and @jbond as reviewers. Hope that's OK.
Change 876237 merged by Btullis:
[operations/puppet@production] Detect the correct disks for the O/S on the cephosd servers
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye
Change 878005 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Correct the units for the cephosd volumes
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye executed with errors:
Change 878005 merged by Btullis:
[operations/puppet@production] Correct the units for the cephosd volumes
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye
Change 878007 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Reduce the size of the partitions on cephosd servers
Change 878007 merged by Btullis:
[operations/puppet@production] Reduce the size of the partitions on cephosd servers
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye
@BTullis Seems you have allready gone through most of the issues i went through. Some addtional things to mention
example by disks with a mix of virtual and direct access disks
ms-be1040 ~ % ls -la /dev/disk/by-id [13:23:42] total 0 drwxr-xr-x 2 root root 1480 Oct 21 14:14 . drwxr-xr-x 8 root root 160 Oct 21 14:14 .. lrwxrwxrwx 1 root root 9 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN -> ../../sdb lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part3 -> ../../sdb3 lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS742006W4480BGN-part4 -> ../../sdb4 lrwxrwxrwx 1 root root 9 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN -> ../../sda lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part3 -> ../../sda3 lrwxrwxrwx 1 root root 10 Oct 25 08:14 ata-SSDSC2KB480G7R_PHYS7420070S480BGN-part4 -> ../../sda4 lrwxrwxrwx 1 root root 9 Oct 25 08:14 md-name-ms-be1040:0 -> ../../md0 lrwxrwxrwx 1 root root 9 Oct 25 08:14 md-name-ms-be1040:1 -> ../../md1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 md-uuid-224e169d:9dde1acd:adc104e4:12beafec -> ../../md1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 md-uuid-d9ca5c37:710fdf2d:34dd3357:836d649c -> ../../md0 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc970709d745 -> ../../sdc lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc970709d745-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a17fa -> ../../sdd lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a17fa-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a9650 -> ../../sde lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070a9650-part1 -> ../../sde1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b0b2b -> ../../sdf lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b0b2b-part1 -> ../../sdf1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b8fa7 -> ../../sdg lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070b8fa7-part1 -> ../../sdg1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070bda32 -> ../../sdh lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070bda32-part1 -> ../../sdh1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070c5b77 -> ../../sdi lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070c5b77-part1 -> ../../sdi1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070cbe8f -> ../../sdj lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070cbe8f-part1 -> ../../sdj1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070d301d -> ../../sdk lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070d301d-part1 -> ../../sdk1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070dad49 -> ../../sdl lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070dad49-part1 -> ../../sdl1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e2543 -> ../../sdn lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e2543-part1 -> ../../sdn1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e9ba9 -> ../../sdm lrwxrwxrwx 1 root root 10 Oct 25 08:14 scsi-36d09466049bccf002269cc97070e9ba9-part1 -> ../../sdm1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x55cd2e414e33c5a2 -> ../../sdb lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part3 -> ../../sdb3 lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5a2-part4 -> ../../sdb4 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x55cd2e414e33c5f8 -> ../../sda lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part3 -> ../../sda3 lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x55cd2e414e33c5f8-part4 -> ../../sda4 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc970709d745 -> ../../sdc lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc970709d745-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a17fa -> ../../sdd lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a17fa-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a9650 -> ../../sde lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070a9650-part1 -> ../../sde1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b0b2b -> ../../sdf lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b0b2b-part1 -> ../../sdf1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b8fa7 -> ../../sdg lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070b8fa7-part1 -> ../../sdg1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070bda32 -> ../../sdh lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070bda32-part1 -> ../../sdh1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070c5b77 -> ../../sdi lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070c5b77-part1 -> ../../sdi1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070cbe8f -> ../../sdj lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070cbe8f-part1 -> ../../sdj1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070d301d -> ../../sdk lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070d301d-part1 -> ../../sdk1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070dad49 -> ../../sdl lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070dad49-part1 -> ../../sdl1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e2543 -> ../../sdn lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e2543-part1 -> ../../sdn1 lrwxrwxrwx 1 root root 9 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e9ba9 -> ../../sdm lrwxrwxrwx 1 root root 10 Oct 25 08:14 wwn-0x6d09466049bccf002269cc97070e9ba9-part1 -> ../../sdm1 ms-be1040 ~ %
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet with OS bullseye completed:
We had similar issues with cloudcephosd* hosts, where the device name would change on reboot, and we sometimes ended up with one of the small drives designed for the OS being used as an OSD and vice versa. I don't remember exactly how we fixed it, I will try to dig out the relevant patches.
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye
Thanks @fnegri - It looks like we're going to have this issue with any server containing many disks, so it will probably help if we work together on a generic solution.
This is what we've ended with for now: partman_early_command.sh#L29-L73|partman_early_command.sh
Would it help you if we added your recipe for the cloudcephsod* hosts to this script too?
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1003.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1003.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1004.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1004.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1005.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1005.eqiad.wmnet with OS bullseye completed:
Thanks @jbond - yes I'm happy enough with this approach now, so I'll resolve this ticket.
- its best to configure the disks as JBOD (and it looks like you have), swift originally had all disks as configured as raid 0 arrays with one disk in them. this means the disks are created as virtual disks, the scis ID is changed from 0:0:N:0 -> 2:0:N:0.
Yes, in fact I specifically requested a simple Host Bus Adapter (Dell HBA330 Mini) with these servers so there is no RAID capability and no virtual disks.
Our Hadoop worker server have also had to use the RAID0-single-disk configuration, as their controller couldn't even use JBOD mode. I thought it better to try to avoid this layer of abstraction for the Ceph servers, which is why I've selected an HBA instead of a RAID controller.
- Due to the discrepancies with virtual disks for swift we decided to uses the by-path directory which also means that when a disk is changed it will have the same mapping i.e. 0:0:4:0 will always be mounted to e.g. /srv/swift/objects4
Yes, I think that this is very likely what we will do too. The mapping of SAS expanders and their physical ports is quite clear from /dev/disk/by-path so I'm happy to use that when we start allocating drives to Ceph.
The thing that it took me some time to understand about the SCSI ordering is that, in BIOS boot mode, this HBA selects one disk to present as the legacy boot device and assigns this to be 0:0:0:0 - everything else is detected afterwards, but is more or less assigned SCSI device LUNs in the order of the sas expanders and ports.
We could maybe avoid this promoting one (and only one) disk if we were to use UEFI to boot, but I didn't want to get into that now. I'm fine with selecting the last physical disk Solid State Disk 0:2:25 to be presented as this legacy boot device on 0:0:0:0 and I'll make a note about what I've done for these hosts here: https://wikitech.wikimedia.org/wiki/Raid_setup
- on swift we also needed a swift_disks fact for the rest of the puppet config. from @MatthewVernon i don't think you need this but let me know if we need to make that a bit more generic or to add something for ceph
Thanks. I agree, I don't think we need it yet, but I'll come back to you if it looks like we do.
@BTullis it looks like for cloudcephosd* we used the following partman recipes: cloudcephosd1*) echo partman/standard.cfg partman/raid1-2dev.cfg ;;. My understanding is that probably we got lucky and only on few hosts the wrong drives were selected for the OS RAID, and that was probably fixed by re-partitioning them. @Andrew might remember more details here.
Would it help you if we added your recipe for the cloudcephsod* hosts to this script too?
For the moment, I would not touch the recipe for the cloud* hosts, but we can experiment with it the next time we need to add a new host, which might happen quite soon because we should have new NVMe servers coming soon(ish).
I also wanted to add (as mentioned in #wikimedia-ceph) that there are 2 moments when drive ordering is important: the first one is at partition time (partman, see above); the second one is when adding the drives to the Ceph cluster (with the ceph CLI). For the latter, we used a hacky method that checks the output of lsblk to find out which drives are in the OS RAID and which aren't. I like your idea of using /dev/disk/by-path, much cleaner. We can revisit our cookbook, and check if we can find a way to share the cookbook or the spicerack module between our teams.