Second iteration of this: T403043: [ceph] 2025-08-27 ceph outage when bringing in a big osd host all at once (cloudcephosd1048)
From IRC:
andrewbogott> Andrew Bogott In case there are more after-effects later, here's what just happened: 05:55:21 - I added a new ceph node, cloudcephosd1042, with the cookbook. This went haywire because (due to a race condition in puppet) 1042 was running an old version of ceph, v14 (most of the cluster is running v16) 05:55:21 - Somehow when the v14 client tried to talk to the one v17 client (on cloudcephosd1004) it caused the nodes on cloudcephosd1004 to crash 05:55:21 - Again, 'somehow' that crash didn't just cause a rebalance, but caused a bunch of pgs to go read-only. Very weird behavior for only one node going down, but it happened. 05:55:21 - That meant that for a few minutes ceph misbehaved badly enough that some VMs froze, and a lot of toolforge jobs flapped 05:55:22 - As soon as I switched off 1042 and 1004, everything got better 05:55:22 - I restarted some unhappy nfs worker nodes just in case (although I suspect they would've recovered on their own anyway)
This task is to investigate the current status of the cluster and possible next steps.
Current status:
- cloudcephosd1004:
root@cloudcephmon1004:~# ceph osd df cloudcephosd1004
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
64 ssd 1.74657 1.00000 1.7 TiB 927 GiB 922 GiB 60 MiB 4.3 GiB 862 GiB 51.80 1.03 79 up
65 ssd 1.74657 1.00000 1.7 TiB 932 GiB 928 GiB 57 MiB 3.9 GiB 857 GiB 52.09 1.04 79 up
66 ssd 1.74657 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
67 ssd 1.74657 1.00000 1.7 TiB 961 GiB 957 GiB 56 MiB 4.1 GiB 827 GiB 53.74 1.07 82 up
68 ssd 1.74657 1.00000 1.7 TiB 757 GiB 753 GiB 53 MiB 3.7 GiB 1.0 TiB 42.32 0.84 28 up
69 ssd 1.74657 1.00000 1.7 TiB 819 GiB 815 GiB 55 MiB 3.4 GiB 970 GiB 45.77 0.91 13 up
70 ssd 1.74657 1.00000 1.7 TiB 937 GiB 933 GiB 59 MiB 4.0 GiB 852 GiB 52.38 1.04 81 up
71 ssd 1.74657 1.00000 1.7 TiB 962 GiB 958 GiB 59 MiB 3.9 GiB 826 GiB 53.79 1.07 86 up
TOTAL 12 TiB 6.1 TiB 6.1 TiB 398 MiB 27 GiB 6.1 TiB 50.27
MIN/MAX VAR: 0.84/1.07 STDDEV: 4.10root@cloudcephosd1004:~# systemctl status ceph-osd@* | grep -A 6 '^. *ceph-osd@'
● ceph-osd@69.service - Ceph object storage daemon osd.69
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 10:36:56 UTC; 3min 20s ago
Process: 190780 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 69 (code=exited, status=0/SUCCESS)
Main PID: 190784 (ceph-osd)
Tasks: 76
Memory: 4.3G
--
● ceph-osd@64.service - Ceph object storage daemon osd.64
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 03:42:18 UTC; 6h ago
Main PID: 13602 (ceph-osd)
Tasks: 76
Memory: 7.1G
CPU: 54min 42.002s
--
● ceph-osd@65.service - Ceph object storage daemon osd.65
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 08:41:10 UTC; 1h 59min ago
Main PID: 142238 (ceph-osd)
Tasks: 76
Memory: 7.5G
CPU: 24min 42.939s
--
● ceph-osd@70.service - Ceph object storage daemon osd.70
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 03:42:18 UTC; 6h ago
Main PID: 13562 (ceph-osd)
Tasks: 76
Memory: 8.3G
CPU: 54min 57.624s
--
● ceph-osd@66.service - Ceph object storage daemon osd.66
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: failed (Result: signal) since Thu 2025-08-21 09:57:00 UTC; 43min ago
Process: 174549 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 66 (code=exited, status=0/SUCCESS)
Process: 174555 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 66 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Main PID: 174555 (code=killed, signal=ABRT)
CPU: 1min 233ms
--
● ceph-osd@67.service - Ceph object storage daemon osd.67
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 10:11:31 UTC; 28min ago
Process: 180510 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 67 (code=exited, status=0/SUCCESS)
Main PID: 180516 (ceph-osd)
Tasks: 76
Memory: 6.9G
--
● ceph-osd@71.service - Ceph object storage daemon osd.71
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 03:42:18 UTC; 6h ago
Main PID: 13541 (ceph-osd)
Tasks: 76
Memory: 6.8G
CPU: 44min 1.316s
--
● ceph-osd@68.service - Ceph object storage daemon osd.68
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2025-08-21 10:33:43 UTC; 6min ago
Process: 189234 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 68 (code=exited, status=0/SUCCESS)
Main PID: 189239 (ceph-osd)
Tasks: 76
Memory: 6.5G- cloudcephosd1042, not in the cluster
Open questions
- Was sal down during the incident?
- Why are osds in coludcephosd1004 still failing to come up?
- Why do cloudcephosd1044/47/43 have pings + jumbo frames loss?
- Why did the addition of the v14 node cause the whole cluster to halt?
- <add more>
Possible improvements
- Track the installed ceph version in hiera/cookbook so it will refuse to bring up an osd with the wrong one
- Improve puppet so it installs the correct ceph version from the repos, instead of just making sure it's installed (as puppet runs before the component repo is setup, it installs an older version)
- <add more>


