Verify that cephosd* server reimages work without adversely affecting cluster availability
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Aug 19 2024, 2:50 PM

Description

We need to be certain that the bootstrapping process works for our ceph servers (cephosd*) without adversely affecting the cluster availability or its configuration.

The last time we tried a reimage, the 20 osds that were previously known to the host were not detected and the server ended up creatng 20 new OSDs

Something went wrong with the unless condition here: https://github.com/wikimedia/operations-puppet/blob/production/modules/ceph/manifests/osd.pp#L82-L94 which means that the cluster could not associate its local disks with OSDs that were present in the cluster.

The command: ceph-volume lvm list ${device} on a newly reimaged ceph server doesn't return the expected value, so another OSD is created instead.

Details

Subject	Repo	Branch	Lines +/-
cephosd: Remove LVM signatures in addition to MD RAID metadata	operations/puppet	production	+16 -1
cephosd: Do not fail if no MD RAID arrays are dicovered	operations/puppet	production	+1 -1
cephosd: Fix the grep for finding MD array members	operations/puppet	production	+1 -1
cephosd: Assemble the MD RAID arrays, so that they can be removed	operations/puppet	production	+8 -21
cephosd: Don't fail if /proc/mdstat doesn't exist	operations/puppet	production	+1 -1
cephosd: Disable swap devices prior to removing MD RAID metadata	operations/puppet	production	+7 -0
cephosd: Remove MD RAID metadata from devices prior to install	operations/puppet	production	+18 -6
cephosd: Update logical volume labels to match mounts	operations/puppet	production	+2 -2
cephosd: remove a subset of LVM signatures during reimage	operations/puppet	production	+15 -0
Disable wiping LVM signatures on cephosd server reimages	operations/puppet	production	+0 -1

Customize query in gerrit

Related Objects

Mentioned Here: T362993: Update cephosd100[1-5] with the most recent stable version of Ceph

Event Timeline

BTullis created this task.Aug 19 2024, 2:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 19 2024, 2:50 PM

BTullis moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Aug 19 2024, 2:51 PM

Change #1063824 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable wiping LVM signatures on cephosd server reimages

https://gerrit.wikimedia.org/r/1063824

gerritbot added a project: Patch-For-Review.Aug 19 2024, 2:56 PM

BTullis triaged this task as High priority.Aug 19 2024, 2:57 PM

Change #1063824 merged by Btullis:

[operations/puppet@production] Disable wiping LVM signatures on cephosd server reimages

https://gerrit.wikimedia.org/r/1063824

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Mentioned in SAL (#wikimedia-analytics) [2024-08-21T12:17:30Z] <btullis> reimaging cephosd1005 for T372783

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Something is still not quite right because the installer paused at this question, asking whether or not the following LVM volume groups and physical volumes should be removed.

I selected No, but there was an error reported in /var/log/syslog

Aug 21 12:33:35 partman-auto-raid: mdadm: cannot open /dev/sda1: Device or resource busy
Aug 21 12:33:35 partman-auto-raid: Error creating array /dev/md3

Looking at another cephosd host, I think that we can see why. It's because the host O/S also uses LVM and the signatures on the ceph, root, and var logical volumes should be removed.

btullis@cephosd1001:~$ sudo pvs
  PV         VG                                        Fmt  Attr PSize   PFree  
  /dev/md2   cephosd1001-vg                            lvm2 a--  441.41g <92.18g
  /dev/sdb   ceph-cc61d327-b7d1-4de9-bb9e-2c1a98b8a34f lvm2 a--    3.49t      0 
  /dev/sdc   ceph-982d988d-0171-449c-845f-68d02e12fe1a lvm2 a--    3.49t      0 
  /dev/sdd   ceph-93a4e8e0-b331-4a93-9d9a-aa5f2afa1cbb lvm2 a--    3.49t      0 
  /dev/sde   ceph-23916282-e567-4e78-832c-1277ce19978b lvm2 a--    3.49t      0 
  /dev/sdf   ceph-832ff915-6b38-4c2c-b70a-ecb78ee2cde6 lvm2 a--    3.49t      0 
  /dev/sdg   ceph-a3495841-7ada-4cd2-a9ea-6fb8d618f48a lvm2 a--    3.49t      0 
  /dev/sdh   ceph-5c71e79b-800e-4e04-aaa7-b34ca29e42af lvm2 a--    3.49t      0 
  /dev/sdi   ceph-dcfe61d4-a6e4-4ad7-9f6e-522431414f31 lvm2 a--    3.49t      0 
  /dev/sdk   ceph-4b9fc877-6bf3-4d7f-b956-ff0174bcffb8 lvm2 a--   16.37t      0 
  /dev/sdl   ceph-4257bb2d-07d0-468f-9ba8-28a9a4dc6c69 lvm2 a--   16.37t      0 
  /dev/sdm   ceph-feb4502b-95dc-4198-aa96-a9670b644eb0 lvm2 a--   16.37t      0 
  /dev/sdn   ceph-16404e38-f89c-4245-9e7e-4ebdea6399b4 lvm2 a--   16.37t      0 
  /dev/sdo   ceph-357695fa-45f5-4977-8962-a21bb7c7883a lvm2 a--   16.37t      0 
  /dev/sdp   ceph-df28f993-342e-4d7c-bb60-80b9affcd4e7 lvm2 a--   16.37t      0 
  /dev/sdq   ceph-79272494-9fc5-4668-ad6b-7731f4f786b7 lvm2 a--   16.37t      0 
  /dev/sdr   ceph-1abdabae-0c18-4cfb-8b38-c68c88aa4b85 lvm2 a--   16.37t      0 
  /dev/sds   ceph-2a7a1121-92a0-4bc4-900f-9c0ef478090e lvm2 a--   16.37t      0 
  /dev/sdt   ceph-f6727e01-81ba-4b93-8345-be88d89c4965 lvm2 a--   16.37t      0 
  /dev/sdu   ceph-9639da0e-0f32-4120-8c5f-3875534d1d7b lvm2 a--   16.37t      0 
  /dev/sdv   ceph-f0ae6e5f-c1eb-481b-8f25-1ebf94c4c5b3 lvm2 a--   16.37t      0 

btullis@cephosd1001:~$ sudo lvs
  LV                                             VG                                        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-8513dc17-cdf8-4de4-9d0a-366bf4554487 ceph-16404e38-f89c-4245-9e7e-4ebdea6399b4 -wi-ao----  16.37t                                                    
  osd-block-1bf61211-7c1b-4c63-a426-364c0b6c9fdd ceph-1abdabae-0c18-4cfb-8b38-c68c88aa4b85 -wi-ao----  16.37t                                                    
  osd-block-26adc64b-e6d4-4ff1-9bf4-73846dd1d5be ceph-23916282-e567-4e78-832c-1277ce19978b -wi-ao----   3.49t                                                    
  osd-block-c19effaf-9c5d-49d6-a34a-3c567b16458e ceph-2a7a1121-92a0-4bc4-900f-9c0ef478090e -wi-ao----  16.37t                                                    
  osd-block-6d8b1a9f-86fe-4b78-8dd1-fb499f5ca0bc ceph-357695fa-45f5-4977-8962-a21bb7c7883a -wi-ao----  16.37t                                                    
  osd-block-02347158-3ce3-4a2c-88db-40d04c92cac5 ceph-4257bb2d-07d0-468f-9ba8-28a9a4dc6c69 -wi-ao----  16.37t                                                    
  osd-block-8f05dffe-b033-49f2-b895-b3fc11b397f0 ceph-4b9fc877-6bf3-4d7f-b956-ff0174bcffb8 -wi-ao----  16.37t                                                    
  osd-block-fa1a3475-f061-4470-84ae-63bd691ce152 ceph-5c71e79b-800e-4e04-aaa7-b34ca29e42af -wi-ao----   3.49t                                                    
  osd-block-c8d699b5-fa33-4596-8180-22dcf81da72a ceph-79272494-9fc5-4668-ad6b-7731f4f786b7 -wi-ao----  16.37t                                                    
  osd-block-2b2e057a-79f4-41f9-b2ec-3913be78000d ceph-832ff915-6b38-4c2c-b70a-ecb78ee2cde6 -wi-ao----   3.49t                                                    
  osd-block-51ee9dfc-44f8-4cff-afa8-88338af445f6 ceph-93a4e8e0-b331-4a93-9d9a-aa5f2afa1cbb -wi-ao----   3.49t                                                    
  osd-block-4eadaedd-79bb-4202-b45c-ccd5ecd0144a ceph-9639da0e-0f32-4120-8c5f-3875534d1d7b -wi-ao----  16.37t                                                    
  osd-block-84340a1f-3ba3-41ae-8a2d-de9bc4c5bf09 ceph-982d988d-0171-449c-845f-68d02e12fe1a -wi-ao----   3.49t                                                    
  osd-block-b35603ca-72c2-48ff-ad39-c38a34f81125 ceph-a3495841-7ada-4cd2-a9ea-6fb8d618f48a -wi-ao----   3.49t                                                    
  osd-block-ae2e27de-5cb1-440c-9c9b-ee151a13839f ceph-cc61d327-b7d1-4de9-bb9e-2c1a98b8a34f -wi-ao----   3.49t                                                    
  osd-block-5206bf20-7b00-454a-b378-2be145da0ffa ceph-dcfe61d4-a6e4-4ad7-9f6e-522431414f31 -wi-ao----   3.49t                                                    
  osd-block-9961d48f-3389-4a13-96f4-a030674a0efc ceph-df28f993-342e-4d7c-bb60-80b9affcd4e7 -wi-ao----  16.37t                                                    
  osd-block-c6cf3c83-7be5-4575-a056-19defee77baa ceph-f0ae6e5f-c1eb-481b-8f25-1ebf94c4c5b3 -wi-ao----  16.37t                                                    
  osd-block-50a4ebdd-4211-4c69-9293-bd7184c61872 ceph-f6727e01-81ba-4b93-8345-be88d89c4965 -wi-ao----  16.37t                                                    
  osd-block-ff5fa832-114e-457a-a168-4862309034ae ceph-feb4502b-95dc-4198-aa96-a9670b644eb0 -wi-ao----  16.37t                                                    
  ceph                                           cephosd1001-vg                            -wi-ao---- <93.13g                                                    
  root                                           cephosd1001-vg                            -wi-ao---- <69.85g                                                    
  var                                            cephosd1001-vg                            -wi-ao---- 186.26g

This differs from, for example, moss-be1001 which doesn't use LVM for its root and other local file systems.

btullis@moss-be1001:~$ df -h -x tmpfs -x devfs
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
/dev/md0        438G   11G  405G   3% /

I suppose that I could revert to using traditional partitioning for the root on these ceph servers, but I will have a look to see if there are any other possible solutions first.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1064388 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: remove a subset of LVM signatures during reimage

https://gerrit.wikimedia.org/r/1064388

I have modified the modules/install_server/files/autoinstall/scripts/partman_early_command.sh script so that it removes the LVM signatures on the drives required for the O/S installation only.

At the moment this updated script would also be implemented for the cloudceph* servers, so I will check that they are happy with this approach as well, before proceeding.
If not, I can update the hostname match to exclude those servers.
cc @MatthewVernon @fnegri

BTullis moved this task from In Progress to Needs Review on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Aug 21 2024, 2:43 PM

I think this is fine and I actually feel more confident if we share the same scripts for partitioning cephosd and cloudcedphosd, so hopefully we have more chances of catching bugs and edge cases. :)

Something went wrong with the unless condition here: https://github.com/wikimedia/operations-puppet/blob/production/modules/ceph/manifests/osd.pp#L82-L94 which means that the cluster could not associate its local disks with OSDs that were present in the cluster.

Should we also remove this "unless" condition if device_remove_lvm is now set to false (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064388) and we're relying on the new remove_my_hostname_lvm function?

Should we also remove this "unless" condition

Or is it required to avoid running "lvm prepare" on the OS volume?

In T372783#10081825, @fnegri wrote:

Should we also remove this "unless" condition

Or is it required to avoid running "lvm prepare" on the OS volume?

Yes, it's exactly that. We want ceph-volume lvm list ${device} to work on a newly reimaged host with existing bluestore OSDs.

We tried telling it not to remove the LVM signatures, but then the O/S installation had a problem because it found an existing signature on /dev/md2.

Change #1064399 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Update logical volume labels to match mounts

https://gerrit.wikimedia.org/r/1064399

Change #1064388 merged by Btullis:

[operations/puppet@production] cephosd: remove a subset of LVM signatures during reimage

https://gerrit.wikimedia.org/r/1064388

Change #1064399 merged by Btullis:

[operations/puppet@production] cephosd: Update logical volume labels to match mounts

https://gerrit.wikimedia.org/r/1064399

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1064727 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Remove MD RAID metadata from devices prior to install

https://gerrit.wikimedia.org/r/1064727

Change #1064727 merged by Btullis:

[operations/puppet@production] cephosd: Remove MD RAID metadata from devices prior to install

https://gerrit.wikimedia.org/r/1064727

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Change #1064735 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Disable swap devices prior to removing MD RAID metadata

https://gerrit.wikimedia.org/r/1064735

Change #1064735 merged by Btullis:

[operations/puppet@production] cephosd: Disable swap devices prior to removing MD RAID metadata

https://gerrit.wikimedia.org/r/1064735

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1064773 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Don't fail if /proc/mdstat doesn't exist

https://gerrit.wikimedia.org/r/1064773

Change #1064773 merged by Btullis:

[operations/puppet@production] cephosd: Don't fail if /proc/mdstat doesn't exist

https://gerrit.wikimedia.org/r/1064773

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1064807 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Assemble the MD RAID arrays, so that they can be removed

https://gerrit.wikimedia.org/r/1064807

https://netbox.wikimedia.org/extras/scripts/results/82927/

cephosd1005 (WMF10631) Device is Active in Netbox but is missing from PuppetDB (should be ('decommissioning', 'inventory', 'offline', 'planned', 'staged', 'failed'))

I've set it to failed based on T372783#10085489. A successful re-image will automatically set it to active.

Change #1064807 merged by Btullis:

[operations/puppet@production] cephosd: Assemble the MD RAID arrays, so that they can be removed

https://gerrit.wikimedia.org/r/1064807

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Change #1065143 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Fix the grep for finding MD array members

https://gerrit.wikimedia.org/r/1065143

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1065143 merged by Btullis:

[operations/puppet@production] cephosd: Fix the grep for finding MD array members

https://gerrit.wikimedia.org/r/1065143

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Change #1065146 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Do not fail if no MD RAID arrays are dicovered

https://gerrit.wikimedia.org/r/1065146

Change #1065146 merged by Btullis:

[operations/puppet@production] cephosd: Do not fail if no MD RAID arrays are dicovered

https://gerrit.wikimedia.org/r/1065146

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm executed with errors:

cephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1005.eqiad.wmnet with OS bookworm completed:

cephosd1005 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408231040_btullis_3768973_cephosd1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status failed -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1004.eqiad.wmnet with OS bookworm executed with errors:

cephosd1004 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1004.eqiad.wmnet with OS bookworm executed with errors:

cephosd1004 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1065180 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Remove LVM signatures in addition to MD RAID metadata

https://gerrit.wikimedia.org/r/1065180

Change #1065180 merged by Btullis:

[operations/puppet@production] cephosd: Remove LVM signatures in addition to MD RAID metadata

https://gerrit.wikimedia.org/r/1065180

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1004.eqiad.wmnet with OS bookworm

In T372783#10081677, @BTullis wrote:

I have modified the modules/install_server/files/autoinstall/scripts/partman_early_command.sh script so that it removes the LVM signatures on the drives required for the O/S installation only.

I am getting there now, after many iterations of the script and aborted reimages.

My initial attempt didn't work because when partman_early_command.sh is executed the MD RAID devices haven't been assembled, so /proc/mdstat isn't present and the /dev/md2 PV isn't detected.

I have had to change the logic so that it does the following:

Scan for any MD devices with mdadm --assemble --scan but do not fail if nothing is found
Disable any swap devices that may have been activated on MD RAID
Ascertain if any PVs are using devices named /dev/md* - If there are:
- Remove any LVs using this device
- Remove any VG matching $(hostname)-vg
- Remove the matching PV (/dev/md2)
Stop all MD arrays /dev/md/*
Zero out the MD signatures on all RAID member partitions
Continue with the installation

What is reassuring is that the original 20 OSDs on the reimaged host were successfully started on the reimaged host, which was the original intention.

I an now testing on cephosd1004 and once this is done, I will reimage 1003, 1002, 1001 in sequence.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1004.eqiad.wmnet with OS bookworm completed:

cephosd1004 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408231208_btullis_3790944_cephosd1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1003.eqiad.wmnet with OS bookworm completed:

cephosd1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408231257_btullis_3802456_cephosd1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1002.eqiad.wmnet with OS bookworm completed:

cephosd1002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408231445_btullis_3820227_cephosd1002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1001.eqiad.wmnet with OS bookworm executed with errors:

cephosd1001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cephosd1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephosd1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephosd1001.eqiad.wmnet with OS bookworm completed:

cephosd1001 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408231554_btullis_3831251_cephosd1001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

All five ceph servers have now been successfully reimaged using this new method and the OSDs were successfully reused.

I see one error which is:

2024-08-23T16:30:00.000203+0000 mon.cephosd1001 (mon.0) 394 : cluster 3 Health detail: HEALTH_WARN 14 mgr modules have failed dependencies
2024-08-23T16:30:00.000250+0000 mon.cephosd1001 (mon.0) 395 : cluster 3 [WRN] MGR_MODULE_DEPENDENCY: 14 mgr modules have failed dependencies
2024-08-23T16:30:00.000263+0000 mon.cephosd1001 (mon.0) 396 : cluster 3     Module 'balancer' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000324+0000 mon.cephosd1001 (mon.0) 397 : cluster 3     Module 'crash' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000332+0000 mon.cephosd1001 (mon.0) 398 : cluster 3     Module 'devicehealth' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000340+0000 mon.cephosd1001 (mon.0) 399 : cluster 3     Module 'iostat' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000345+0000 mon.cephosd1001 (mon.0) 400 : cluster 3     Module 'nfs' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000353+0000 mon.cephosd1001 (mon.0) 401 : cluster 3     Module 'orchestrator' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000357+0000 mon.cephosd1001 (mon.0) 402 : cluster 3     Module 'pg_autoscaler' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000369+0000 mon.cephosd1001 (mon.0) 403 : cluster 3     Module 'progress' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000375+0000 mon.cephosd1001 (mon.0) 404 : cluster 3     Module 'prometheus' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000381+0000 mon.cephosd1001 (mon.0) 405 : cluster 3     Module 'rbd_support' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000387+0000 mon.cephosd1001 (mon.0) 406 : cluster 3     Module 'restful' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000391+0000 mon.cephosd1001 (mon.0) 407 : cluster 3     Module 'status' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000397+0000 mon.cephosd1001 (mon.0) 408 : cluster 3     Module 'telemetry' has failed dependency: No module named 'bcrypt'
2024-08-23T16:30:00.000401+0000 mon.cephosd1001 (mon.0) 409 : cluster 3     Module 'volumes' has failed dependency: No module named 'bcrypt'

That's the same error we saw previously here T362993#9763422

It will be fixed when we pull in a new point release, but for now I will use the same manual fix.

btullis@cumin1002:~$ sudo cumin A:cephosd 'apt install python3-bcrypt'
btullis@cumin1002:~$ sudo cumin -b 1 -s 10 A:cephosd 'systemctl restart ceph-mgr.target'

I unset the noout flag, which re-enables automatic data relocation:

btullis@cephosd1002:~$ sudo ceph osd unset noout
noout is unset

Now the cluster is all healthy again.

btullis@cephosd1002:/var/log/ceph$ sudo ceph -s
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 33m)
    mgr: cephosd1005(active, since 40s), standbys: cephosd1001, cephosd1003, cephosd1004, cephosd1002
    osd: 100 osds: 100 up (since 30m), 100 in (since 3M)
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    pools:   10 pools, 289 pgs
    objects: 446.33k objects, 135 GiB
    usage:   28 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     289 active+clean
 
  io:
    client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr

	F57283958: image.png
	Aug 21 2024, 12:33 PM

Verify that cephosd* server reimages work without adversely affecting cluster availabilityClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Verify that cephosd* server reimages work without adversely affecting cluster availability
Closed, ResolvedPublic
Actions