Page MenuHomePhabricator

2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures
Closed, ResolvedPublic

Description

  • Toolforge tools were not responding to http requests (Tools-proxy-9 was returning an error page)
  • We found that Ceph had intermittent issues since last night, after some hosts were upgraded to Bookworm
  • This caused intermittent issues to both Toolforge and Cloud VPS
  • We downgraded back to Bullseye the 6 hosts that were previously upgraded to Bookworm
    • Cloudcephosd1006
    • Cloudcephosd1007
    • Cloudcephosd1008
    • Cloudcephosd1035
    • Cloudsephosd1036
    • Cloudcephosd1037

Incident doc: https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0#heading=h.nz4dlhpgbsjm

Incident Report: https://wikitech.wikimedia.org/wiki/Incidents/2025-07-11_WMCS_Ceph_issues_causing_Toolforge_and_Cloud_VPS_failures

Timeline (UTC)

01:59 PROBLEM - SSH on cloudcephosd1035 is CRITICAL
02:08 RECOVERY - SSH on cloudcephosd1035 is OK
04:20 SSH on cloudcephosd1036 is CRITICAL
05:02 RECOVERY - SSH on cloudcephosd1036 is OK
05:28 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
05:31 RECOVERY - SSH on cloudcephosd1037 is OK
07:13 many cloud-vps hosts reported down by prometheus, but nobody notices
07:14 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
07:30 RECOVERY - SSH on cloudcephosd1037 is OK
07:16 FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
07:19 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
07:20 cloud-vps back to normal (no hosts reported down)
07:21 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 779 slow ops
07:24 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra
07:27 FIRING: CephSlowOps: Ceph cluster in eqiad has 908 slow ops
07:32 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 908 slow ops
08:08 many cloud-vps hosts again reported down
08:10 FIRING: CephSlowOps: Ceph cluster in eqiad has 1678 slow ops
08:13 PROBLEM - SSH on cloudcephosd1036 is CRITICAL
08:18 wmcs-dnsleaks fails on cloudcontrol1007 (possibly unrelated)
08:18 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
08:19 cloud-vps back to normal (no hosts reported down)
08:20 Manuel reports switchmaster.toolforge.org is down
08:23 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra
08:23 FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)
08:27 lucas.werkmeister@wikimedia.de reports all tools are returning an error from tools-proxy-9
08:39 <lucaswerkmeister> I can SSH into tools-proxy-9, the only failed systemd unit is logrotate which judging by the journal has been broken for a long time, probably not related
08:44 <lucaswerkmeister> I think tools-proxy-9 times out trying to reach k8s.tools.eqiad1.wikimedia.cloud in turn
08:44 <lucaswerkmeister> I can SSH into that one too, no high load there either
08:52 Incident opened. Francesco Negri becomes IC.
08:55 Toolforge is working again. No action was taken.
08:58 RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)
09:01 Incident is resolved (temporarily)
09:25 Francesco Negri starts wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-77, tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-37, as they were alerting with “many processes in D state”
09:28 SSH on cloudcephosd1036 is OK
09:39 PROBLEM - SSH on cloudcephosd1035 is CRITICAL
09:42 RECOVERY - SSH on cloudcephosd1035 is OK
10:20 PROBLEM - SSH on cloudcephosd1036 is CRITICAL
10:21 FIRING: CephSlowOps: Ceph cluster in eqiad has 847 slow ops
10:24 RECOVERY - SSH on cloudcephosd1036 is OK
10:26 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1386 slow ops
10:28 FIRING: CephSlowOps: Ceph cluster in eqiad has 5134 slow ops
10:33 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1272 slow ops
11:17 <andrewbogott> I'm still half asleep and haven't read the backscroll, but my emails suggest that ceph pacific + bookworm + ceph traffic is a bad combination.
11:18 <andrewbogott> So probably the fix for this is for me to downgrade those hosts back to bullseye.
11:23 <andrewbogott> The bookworm hosts are 1006-1008,1035-1037
11:26 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037
11:35 SSH on cloudcephosd1008 is CRITICAL
11:41 SSH on cloudcephosd1008 is OK
11:41 FIRING: WidespreadInstanceDown
11:46 RESOLVED: WidespreadInstanceDown
12:13 SSH on cloudcephosd1035 is CRITICAL
12:17 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037
12:18 FIRING: WidespreadInstanceDown
12:20 SSH on cloudcephosd1035 is OK
12:20 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037
12:23 RESOLVED: WidespreadInstanceDown
12:44 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037
13:11 cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037
14:12 FIRING: WidespreadInstanceDown
14:24 Reopening the incident
14:29 <dhinus> things seem to get worse after 14:12 UTC
14:30 <andrewbogott> 1007 is frozen right now. So we /do/ have two [ceph hosts] down at once, which could maybe explain current bad behavior.
14:41 <dhinus> we have now 9 OSDs down (compared to 16 before) – [one ceph host recovered]
14:58 (Slack) https://wikipedialibrary.wmflabs.org/ is down right now too.
15:08 (Slack) Seeing failures on catalyst environments too
15:10 Many Cloud VPS VMs are down (29% of VMs in project “tools”)
15:14 <dhinus> many cloud vps VMs are still not working, and are not recovering for <reasons>
15:16 <dhinus> ceph IOPS are at about 50% of what they were this morning
15:16 <andrewbogott> ok, 1037 is up, now there's just a bit of pg shuffling to do before we can stop another host
15:16 <dhinus> do you know why we still have 1 OSD down?
15:18 <andrewbogott> I just checked, that's on 1013 which as far as I know hasn't suffered any recent maintenance. The down OSD is associated with a volume that doesn't appear in lsblk so... a mystery but /probably/ an unrelated one.
15:19 <andrewbogott> ceph is still recovering, down to 514 pgs
15:23 <dhinus> I tried manually rebooting a couple of VMs, and they do come back... but it will take a looooong time if we need to reboot all manually
15:29 <dhinus> count(up{job="node"} == 0) is finally looking good – All VMs are now reporting as healthy
15:34 <+wm-bb> <Vincent> My tool is up and running now, thank you :)
15:47 <+wm-bb> <Yetkin> My tool is up and running as well 😊
15:50 Handing off IC to Andrew Bogott
16:09 alertmanager is complaining about OpenstackAPIResponse, slow response times only for designate-api
16:11 <andrewbogott> I'm restarting designate services
17:25 Finished reimaging of cloudcephosd1035. Remaining Bookworm OSD nodes are cloudcephosd100[6-8].
18:33 All OSDs back to Bullseye, ceph shows no misplaced objects.
18:33 Incident closed

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host cloudcephosd1037.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1037 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1037.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037.eqiad.wmnet with OS bullseye

There is an issue at present with reimaging these cloudcephosd nodes back to bullseye.
The problem arises because of this remove_os_md() function that is excuted.

The idea of this function is to remove the LVM and MD signatures that are present on the O/S drives, while leaving the other LVM devices intact.
The issue at the moment is that the version of lvremove in bullseye does not supoprt the --devices flag that we use in the script.

This causes the partman_early_command.sh script to exit with an error 3 during the debian-installer phase.

As a temporary workaround, you can use the following commands to remove the LVM and MD metadata manually. This can be done while the installer is sitting at the error screen.

Open an ssh console on the host from a cumin host:

btullis@cumin1003:~$ sudo install_console cloudcephosd1037.eqiad.wmnet

Check the logical volumes.

~ # lvs
  LV                                             VG                                        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-ff9a6240-ef29-438c-9d12-1dde40d24c62 ceph-0c90d232-3465-4951-a24e-f0f5d4edd120 -wi-------   3.49t                                                    
  osd-block-b800bb57-7f57-44a3-9f98-9f2cde63da6f ceph-332eb7d8-19d3-496d-b85a-9a0bbd37a812 -wi-------   3.49t                                                    
  osd-block-9117b4cb-96c4-4c37-8780-457cdc4bdb50 ceph-66ddadcb-cc75-4a45-996c-71606a6e7d19 -wi-------   3.49t                                                    
  osd-block-9d064414-4dfb-43ff-b10b-5681a27a6ac9 ceph-670f9ac0-6ebd-4361-9084-d42f9f564fdc -wi-------   3.49t                                                    
  osd-block-cf1daa79-9379-41cc-a794-4bbf47e6f659 ceph-82100b83-836e-4f16-b1bd-b4b025e0422a -wi-------   3.49t                                                    
  osd-block-cf15746c-c097-41c3-af27-a9599456657f ceph-b23b7286-c0e5-4654-9592-c4813e2286fb -wi-------   3.49t                                                    
  osd-block-e86475d1-061c-4a23-a8ac-27b3063b06f1 ceph-c36226ca-cc76-4870-8e9f-5175b8a4c019 -wi-------   3.49t                                                    
  osd-block-778225bf-608b-4b13-bf66-398e58558134 ceph-cc1da8ae-4372-4b93-95e2-bd198c24da0d -wi-------   3.49t                                                    
  root                                           cloudcephosd1037-vg                       -wi------- <69.85g                                                    
  srv                                            cloudcephosd1037-vg                       -wi------- <93.13g                                                    
  var_lib_ceph                                   cloudcephosd1037-vg                       -wi------- 186.26g

Check the volume groups:

~ # vgs                       
  VG                                        #PV #LV #SN Attr   VSize   VFree  
  ceph-0c90d232-3465-4951-a24e-f0f5d4edd120   1   1   0 wz--n-   3.49t      0 
  ceph-332eb7d8-19d3-496d-b85a-9a0bbd37a812   1   1   0 wz--n-   3.49t      0 
  ceph-66ddadcb-cc75-4a45-996c-71606a6e7d19   1   1   0 wz--n-   3.49t      0 
  ceph-670f9ac0-6ebd-4361-9084-d42f9f564fdc   1   1   0 wz--n-   3.49t      0 
  ceph-82100b83-836e-4f16-b1bd-b4b025e0422a   1   1   0 wz--n-   3.49t      0 
  ceph-b23b7286-c0e5-4654-9592-c4813e2286fb   1   1   0 wz--n-   3.49t      0 
  ceph-c36226ca-cc76-4870-8e9f-5175b8a4c019   1   1   0 wz--n-   3.49t      0 
  ceph-cc1da8ae-4372-4b93-95e2-bd198c24da0d   1   1   0 wz--n-   3.49t      0 
  cloudcephosd1037-vg                         1   3   0 wz--n- 441.41g <92.18g

Check the physical volumes:

~ # pvs
  PV         VG                                        Fmt  Attr PSize   PFree  
  /dev/md2   cloudcephosd1037-vg                       lvm2 a--  441.41g <92.18g
  /dev/sdb   ceph-0c90d232-3465-4951-a24e-f0f5d4edd120 lvm2 a--    3.49t      0 
  /dev/sdc   ceph-cc1da8ae-4372-4b93-95e2-bd198c24da0d lvm2 a--    3.49t      0 
  /dev/sdd   ceph-66ddadcb-cc75-4a45-996c-71606a6e7d19 lvm2 a--    3.49t      0 
  /dev/sde   ceph-c36226ca-cc76-4870-8e9f-5175b8a4c019 lvm2 a--    3.49t      0 
  /dev/sdf   ceph-b23b7286-c0e5-4654-9592-c4813e2286fb lvm2 a--    3.49t      0 
  /dev/sdg   ceph-332eb7d8-19d3-496d-b85a-9a0bbd37a812 lvm2 a--    3.49t      0 
  /dev/sdh   ceph-82100b83-836e-4f16-b1bd-b4b025e0422a lvm2 a--    3.49t      0 
  /dev/sdi   ceph-670f9ac0-6ebd-4361-9084-d42f9f564fdc lvm2 a--    3.49t      0

Check the mdraid devices.

~ # cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sda1[0] sdj1[1]
      974848 blocks super 1.2 [2/2] [UU]
      
md1 : active raid1 sda2[0] sdj2[1]
      4877312 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sda3[0] sdj3[1]
      462859264 blocks super 1.2 [2/2] [UU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

Delete the logical volumes:

~ # lvremove cloudcephosd1037-vg/root
  Logical volume "root" successfully removed
~ # lvremove cloudcephosd1037-vg/srv
  Logical volume "srv" successfully removed
~ # lvremove cloudcephosd1037-vg/var_lib_ceph
  Logical volume "var_lib_ceph" successfully removed

Delete the volume group:

~ # vgremove cloudcephosd1037-vg
  Volume group "cloudcephosd1037-vg" successfully removed

Delete the physical volume metadata:

~ # pvremove /dev/md2
  Labels on physical volume "/dev/md2" successfully wiped.

Stop all three of the mdraid devices:

~ # mdadm --stop /dev/md0
mdadm: stopped /dev/md0
~ # mdadm --stop /dev/md1
mdadm: stopped /dev/md1
~ # mdadm --stop /dev/md2
mdadm: stopped /dev/md2

Remove the MD metadata from each of the devices included in the /proc/mdraid devices mentioned above.

~ # mdadm --zero-superblock /dev/sda1
~ # mdadm --zero-superblock /dev/sdj1
~ # mdadm --zero-superblock /dev/sda2
~ # mdadm --zero-superblock /dev/sdj2
~ # mdadm --zero-superblock /dev/sda3
~ # mdadm --zero-superblock /dev/sdj3

Now you can go back to the d-i console and retry the last step, or restart the cookbook, if you prefer.

fnegri renamed this task from 2025-07-11 Toolforge tools not responding to 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures.Jul 11 2025, 2:53 PM
fnegri added a project: Cloud-VPS.
fnegri raised the priority of this task from High to Unbreak Now!.Jul 11 2025, 2:59 PM
fnegri updated the task description. (Show Details)
fnegri updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-releng) [2025-07-11T15:04:50Z] <bd808> Hard reboot of deployment-acme-chief05 (T399281)

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host cloudcephosd1037.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1037 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507111448_btullis_1285849_cloudcephosd1037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-releng) [2025-07-11T15:34:58Z] <bd808> Hard reboot of deployment-webperf21 (T399281)

Andrew updated the task description. (Show Details)
fnegri updated the task description. (Show Details)

Change #1168227 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "cloudceph osd.yaml: update some nic names for Bookworm reimages"

https://gerrit.wikimedia.org/r/1168227

Change #1168227 merged by Andrew Bogott:

[operations/puppet@production] Revert "cloudceph osd.yaml: update some nic names for Bookworm reimages"

https://gerrit.wikimedia.org/r/1168227

dcaro lowered the priority of this task from Unbreak Now! to High.Jul 14 2025, 9:29 AM
fnegri updated the task description. (Show Details)

This incident is resolved. Separate follow-up tasks will be created, refer to the Incident Report at https://wikitech.wikimedia.org/wiki/Incidents/2025-07-11_WMCS_Ceph_issues_causing_Toolforge_and_Cloud_VPS_failures

Looking into the ceph crash list, they were marked as acked for some reason (that's why I did not notice when doing ceph crash ls-new), there's a lot of reports from the osds, I'll try to do a summary:

root@cloudcephosd1006:~# ceph crash ls | grep 2025-07-11 | wc
     44      88    3564

root@cloudcephosd1006:~# for crash_id in $(ceph crash ls | grep 2025-07-11 | awk '{print $1}'); do ceph crash info "$crash_id" >> 2025_07_11_crashes; done

root@cloudcephosd1006:~# cat 2025_07_11_crashes  | jq '.assert_msg' | sed -e 's/thread [^ ]*/thread <tread-id>/g' -e 's/time .*\./time <timestamp>./' | sort | uniq -c
      1 null
     43 "./src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_upgrade_super()' thread <tread-id> time <timestamp>.cc: 10647: FAILED ceph_assert(ondisk_format > 0)\n"

So all but one failed with a similar error, that ceph_assert(ondisk_format > 0), looking

For timing reference, all that happened at ~15:09:21 onwards UTC, the ids:

root@cloudcephosd1006:~# ceph crash ls | grep 2025-07-11 
2025-07-11T07:29:55.114662Z_de9c9567-c13e-4f09-8af3-4078727d413b  osd.286       
2025-07-11_15:09:21.022943Z_f64a817b-c43e-40ee-8a2c-262947cacf7f  osd.280       
2025-07-11_15:09:21.717538Z_d92e5ae6-d0a0-4a8c-95ca-e89f73b8dfec  osd.281       
2025-07-11_15:09:22.524678Z_13eb998a-1a29-4905-a47b-2fc4aa8e3f3d  osd.280       
2025-07-11_15:09:22.560211Z_a536325f-8721-452e-a1e8-1b9c55638f80  osd.282       
2025-07-11_15:09:23.073150Z_691744bc-53b8-4464-a647-2ac24a62688f  osd.281       
2025-07-11_15:09:23.412997Z_b4cc9122-a559-4111-ba79-17d0c8875602  osd.283       
2025-07-11_15:09:24.036850Z_1623ca36-df2d-423c-ae8d-c25b7eeec9bb  osd.280       
2025-07-11_15:09:24.072548Z_94916207-f346-4726-b380-1da5097a6d6f  osd.282       
2025-07-11_15:09:24.192324Z_7f23d419-40ff-4667-9652-9cd977632260  osd.284       
2025-07-11_15:09:24.513078Z_758f71d9-d37a-4b42-8f53-270565b3483c  osd.281       
2025-07-11_15:09:25.054847Z_a5c4eef5-1ac2-48b3-94b2-feb1ad5f86e3  osd.283       
2025-07-11_15:09:25.081087Z_5f576315-6898-4b2a-ba6d-55110d2ad498  osd.285       
2025-07-11_15:09:25.408324Z_4e1be3cc-fbc5-4068-9b6f-0a8bf00567b6  osd.282       
2025-07-11_15:09:25.429016Z_a61d5dfd-5843-4875-ada7-f96ff38fdc1c  osd.280       
2025-07-11_15:09:25.560390Z_33289e9e-541d-4f95-ac28-55c5816e43e4  osd.284       
2025-07-11_15:09:25.876452Z_25f2b232-f174-4ca2-93a7-1118f9e2b6dd  osd.286       
2025-07-11_15:09:26.049898Z_60fc3256-57ea-4520-8402-d60d8567035f  osd.281       
2025-07-11_15:09:26.392617Z_269d6e4d-d4df-4511-85cd-49c381c50afb  osd.283       
2025-07-11_15:09:26.736665Z_8160c42e-4057-49ae-8d94-58e1478c9229  osd.287       
2025-07-11_15:09:26.737190Z_4858aefd-791f-4ebb-9590-43653f458d9e  osd.285       
2025-07-11_15:09:27.016340Z_dc4166ca-df37-48a9-94d9-3627de411904  osd.282       
2025-07-11_15:09:27.016644Z_e004f5bd-8592-4924-9586-636b8b719343  osd.284       
2025-07-11_15:09:27.016912Z_a7a87acf-9e47-40eb-83d6-cc32a3aa6348  osd.280       
2025-07-11_15:09:27.212580Z_f86096d3-221e-4400-ae5b-fa8b2811ce64  osd.286       
2025-07-11_15:09:27.565044Z_7b06dac9-4bc8-4e2e-a985-423082216117  osd.281       
2025-07-11_15:09:27.744515Z_8fe6ae21-80be-4b51-b296-ba986a4c052a  osd.283       
2025-07-11_15:09:28.136965Z_5c37b2bc-8a84-4a3a-a693-f98c068dcfe8  osd.285       
2025-07-11_15:09:28.144337Z_15f8502f-10c5-4024-9a91-1bd5443b599f  osd.287       
2025-07-11_15:09:28.428500Z_2362d50b-d8e3-40e0-8fda-d5a548296435  osd.284       
2025-07-11_15:09:28.456476Z_890d7d6c-7255-49e3-afc7-f1d7b920a563  osd.280       
2025-07-11_15:09:28.460155Z_149564d6-5b69-4ff0-9824-b4dfc2cf2927  osd.282       
2025-07-11_15:09:28.704552Z_80a456da-d0f6-4077-944f-e52135c2e8a6  osd.286       
2025-07-11_15:09:28.905199Z_ab6d7204-d29f-49f5-8b7a-690aeb0b74ab  osd.281       
2025-07-11_15:09:29.268514Z_18e450dd-ba92-427e-9a57-4a2c163c38fc  osd.283       
2025-07-11_15:09:29.501205Z_4bd35d5d-373b-4eeb-a820-85e1e1caa625  osd.285       
2025-07-11_15:09:29.544363Z_4524fbec-03c6-46ac-82d9-1ae02c671cd6  osd.287       
2025-07-11_15:09:29.871131Z_40e5c2e8-cbda-4e24-bfd1-9f2ec974ab8f  osd.280       
2025-07-11_15:09:29.872227Z_c26c978e-9c42-4c73-a261-e3fdee7eb863  osd.282       
2025-07-11_15:09:29.872263Z_c66e8627-322f-4c9b-98b3-e2e592eacece  osd.284       
2025-07-11_15:09:29.977108Z_76363d59-1b69-41ac-8c7e-adcfebaf2818  osd.281       
2025-07-11_15:09:30.100522Z_9d8fb96c-5b11-4111-8f6e-cf65a9e6e960  osd.286       
2025-07-11_15:09:30.376815Z_30d94198-dd91-44b2-a639-121cc596e99e  osd.283       
2025-07-11_15:09:30.849296Z_c6a47cf2-70a0-405e-ac28-009f33bdf306  osd.285

It's also interesting to note that there were only a handful of osds involved:

root@cloudcephosd1006:~# ceph crash ls | grep 2025-07-11  | awk '{print $2}' | sort | uniq -c
      7 osd.280
      7 osd.281
      6 osd.282
      6 osd.283
      5 osd.284
      5 osd.285
      5 osd.286
      3 osd.287

And all in just one node:

root@cloudcephosd1006:~# for osd in $(ceph crash ls | grep 2025-07-11  | awk '{print $2}' | sort | uniq ); do ceph osd find $osd | jq '.host'; done
"cloudcephosd1037"
"cloudcephosd1037"
"cloudcephosd1037"
"cloudcephosd1037"
"cloudcephosd1037"
"cloudcephosd1037"
"cloudcephosd1037"
"cloudcephosd1037"

Looking at the logs of that node, it had just come up, and the osds failed right away:

Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.013 7f178ef7ec00 -1 bluefs _replay 0x0: stop: unrecognized op 12
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.013 7f178ef7ec00 -1 bluefs mount failed to replay log: (5) Input/output error
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.013 7f178ef7ec00 -1 bluefs _replay 0x0: stop: unrecognized op 12
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.013 7f178ef7ec00 -1 bluefs mount failed to replay log: (5) Input/output error
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.013 7f178ef7ec00 -1 bluestore(/var/lib/ceph/osd/ceph-280) _open_bluefs failed bluefs mount: (5) Input/output error
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.013 7f178ef7ec00 -1 bluestore(/var/lib/ceph/osd/ceph-280) _open_bluefs failed bluefs mount: (5) Input/output error
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: 2025-07-11 15:09:21.017 7f178ef7ec00  1 bluestore(/var/lib/ceph/osd/ceph-280) _upgrade_super from 0, latest 2
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: ./src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_upgrade_super()' thread 7f178ef7ec00 time 2025-07-11 15:09:21.018836
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: ./src/os/bluestore/BlueStore.cc: 10647: FAILED ceph_assert(ondisk_format > 0)
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) nautilus (stable)
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12f) [0x56234df3c1b1]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  2: (()+0x4b633c) [0x56234df3c33c]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  3: (BlueStore::_upgrade_super()+0x51b) [0x56234e4ad39b]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  4: (BlueStore::_mount(bool, bool)+0x57d) [0x56234e50d05d]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  5: (OSD::init()+0x4b3) [0x56234e04fc03]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  6: (main()+0x30ec) [0x56234df96c9c]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  7: (__libc_start_main()+0xea) [0x7f178f1bad7a]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  8: (_start()+0x2a) [0x56234dfd009a]
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]: *** Caught signal (Aborted) **
Jul 11 15:09:21 cloudcephosd1037 ceph-osd[10476]:  in thread 7f178ef7ec00 thread_name:ceph-osd

The logs also show errors connecting to the cluster trying to create more crash reports:

Jul 11 15:13:46 cloudcephosd1037 ceph-crash[1212]: WARNING:__main__:post /var/lib/ceph/crash/2025-07-11_15:09:27.565044Z_7b06dac9-4bc8-4e2e-a985-423082216117 as client.crash failed: b'[errno 2] error connecting to the cluster\n'

After that, the osd daemons were able to start again.

Note that all this is after the cluster already started misbehaving, and after reimaging to the previous OS version

We don't have any more system logs from the incident (the reimages deleted all traces), there's some logstash logs for ceph daemons though, looking

There's some interesting logs, for example, the mon notices some slow pings already at 07:06:

Jul 11, 2025 @ 07:06:33.202 mon.cloudcephmon1005 mon.0 192080 : Health check update: Slow OSD heartbeats on front (longest 1962.367ms) (OSD_SLOW_PING_TIME_FRONT)

Jul 11, 2025 @ 07:06:36.240 mon.cloudcephmon1005 mon.0 192087 : Health check update: Slow OSD heartbeats on back (longest 1954.181ms) (OSD_SLOW_PING_TIME_BACK)

First crash setting the cluster in warning:

Jul 11, 2025 @ 07:10:00.360 mon.cloudcephmon1005 mon.0 192208 : Health detail: HEALTH_WARN noout flag(s) set; Slow OSD heartbeats on back (longest 1586.743ms); Slow OSD heartbeats on front (longest 1587.302ms); 1 daemons have recently crashed

Interesting that the noout flag is set, that would prevent the cluster from flagging osds out, avoiding proper rebalancing.

The first slow heartbeats are to cloudcephosd1007:

mon.cloudcephmon1005 mon.0 192211 :     Slow OSD heartbeats on back from osd.68 [F4] to osd.4 [C8] 1586.743 msec possibly improving
root@cloudcephosd1037:~# ceph osd find 4 | jq '.host'
"cloudcephosd1007"
root@cloudcephosd1037:~# ceph osd find 68 | jq '.host'
"cloudcephosd1004"

Then osd.42 (in cloudcephosd1006) crashed:

Jul 11, 2025 @ 07:10:00.363 mon.cloudcephmon1005 mon.0 192235 :     osd.42 crashed on host cloudcephosd1006 at 2025-07-10T15:13:04.678521Z

And slow ops started to spraws, noout was set, we might want to use a more specific noout (https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/#stopping-without-rebalancing).

I think that the reason why the cluster did not self-heal is that with noout, the cluster did not rebalance, and with multiple failures it was unable to heal.

Change #1170341 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudceph osd.yaml: update nic names for 1006

https://gerrit.wikimedia.org/r/1170341

Change #1170342 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] prometheus-node-pinger: fix the script to return 1 on failure

https://gerrit.wikimedia.org/r/1170342

Change #1170342 merged by David Caro:

[operations/puppet@production] prometheus-node-pinger: fix the script to return 1 on failure

https://gerrit.wikimedia.org/r/1170342

Change #1170341 merged by Andrew Bogott:

[operations/puppet@production] cloudceph osd.yaml: update nic names for 1006

https://gerrit.wikimedia.org/r/1170341

Change #1170346 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudceph osd.yaml: update nic names for 1006 again

https://gerrit.wikimedia.org/r/1170346

Change #1170346 merged by Andrew Bogott:

[operations/puppet@production] cloudceph osd.yaml: update nic names for 1006 again

https://gerrit.wikimedia.org/r/1170346

Sorry for the spam, trying to dump info, I'll try to summarize later.

So ceph crashes timeline:

2025-07-10_04:16 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
2025-07-10_15:13 -> osd.120 (cloudcephosd1006, bookworm) crashes with "hit suicide timeout"
2025-07-10_19:00 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
2025-07-10_19:00 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']

2025-07-11_04:26 -> slow ops start mon.cloudcephmon1005 mon.0 184377 : Health check failed: 13 slow ops, oldest one blocked for 72 sec, osd.302-cloudcephosd1036 has slow ops (SLOW_OPS)
2025-07-11_04:26 -> mon.cloudcephmon1005 mon.0 184388 : Health check update: 23 slow ops, oldest one blocked for 89 sec, osd.299 cloudcephosd1036 has slow ops (SLOW_OPS)
2025-07-10_07:12 -> many slow ops mon.cloudcephmon1005 mon.0 193272 : Health check failed: 9 slow ops, oldest one blocked for 31 sec, daemons [osd.0,osd.100,osd.121,osd.144,osd.150,osd.153,osd.177,osd.192,osd.212,osd.220] many hosts ... have slow ops. (SLOW_OPS)

root@cloudcephosd1037:~# osds=osd.0,osd.100,osd.121,osd.144,osd.150,osd.153,osd.177,osd.192,osd.212,osd.220; for osd in $(echo "$osds" | sed -e 's/,/\n/g' | cut -d '.' -f2); do echo -n "osd: $osd - host:"; ceph osd find $osd | jq '.host'; done | sort | uniq
osd: 0 - host:"cloudcephosd1007"
osd: 100 - host:"cloudcephosd1021"
osd: 121 - host:"cloudcephosd1019"
osd: 144 - host:"cloudcephosd1022"
osd: 150 - host:"cloudcephosd1023"
osd: 153 - host:"cloudcephosd1024"
osd: 177 - host:"cloudcephosd1029"
osd: 192 - host:"cloudcephosd1039"
osd: 212 - host:"cloudcephosd1027"
osd: 220 - host:"cloudcephosd1028"
More messages and osds join, including (not limited to):
osd: 0 - host:"cloudcephosd1007"
osd: 100 - host:"cloudcephosd1021"
osd: 106 - host:"cloudcephosd1017"
osd: 107 - host:"cloudcephosd1017"
osd: 110 - host:"cloudcephosd1017"
osd: 116 - host:"cloudcephosd1018"
osd: 117 - host:"cloudcephosd1018"
osd: 119 - host:"cloudcephosd1018"
osd: 11 - host:"cloudcephosd1008"
osd: 120 - host:"cloudcephosd1019"
osd: 122 - host:"cloudcephosd1019"
osd: 124 - host:"cloudcephosd1019"
osd: 151 - host:"cloudcephosd1023"
osd: 154 - host:"cloudcephosd1024"
osd: 158 - host:"cloudcephosd1024"
osd: 194 - host:"cloudcephosd1039"
osd: 204 - host:"cloudcephosd1026"
osd: 206 - host:"cloudcephosd1026"
osd: 214 - host:"cloudcephosd1027"
osd: 218 - host:"cloudcephosd1028"
osd: 249 - host:"cloudcephosd1035"
osd: 253 - host:"cloudcephosd1035"
osd: 254 - host:"cloudcephosd1035"
osd: 279 - host:"cloudcephosd1035"
osd: 284 - host:"cloudcephosd1037"
osd: 48 - host:"cloudcephosd1015"
goes on until 17:20

2025-07-11_07:29 -> osd.286 (cloudcephosd1037, bookworm) crashes without assertion, on the function:

"/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7f8f2b25b050]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x17c) [0x56339f0ec7bc]",
Big commit latency peak on that osd

2025-07-11_11:26 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_12:17 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_12:20 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_12:30 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bookworm
2025-07-11_12:30 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bookworm
2025-07-11_12:30 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bookworm
2025-07-11_13:10 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bookworm
2025-07-11_13:11 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_13:20 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bookworm
2025-07-11_13:14 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bookworm
2025-07-11_13:55 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_13:48 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_13:48 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_13:41 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_14:09 sudo swapoff /dev/md1 on cloudcephosd1036 T399281
2025-07-11_14:24 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_14:25 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_15:06 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2025-07-11_15:09 to 2025-07-11_15:09 -> osd.{80-87}, 47 crashes total (coludcephosd1037, bullseye) crash with ./src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_upgrade_super()' thread <tread-id> time <timestamp>.cc: 10647: FAILED ceph_assert(ondisk_format > 0)\n
2025-07-11_15:38 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye
2025-07-11_15:56 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye
2025-07-11_16:46 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bullseye
2025-07-11_16:51 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye
2025-07-11_16:51 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1035.eqiad.wmnet with OS bullseye
2025-07-11_17:20 Last slow op mon.cloudcephmon1005 mon.0 266549 : [WRN] SLOW_OPS: 4 slow ops, oldest one blocked for 37 sec, osd.254 has slow ops
2025-07-11_17:32 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bullseye
2025-07-11_17:39 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye
2025-07-11_17:45 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bullseye
2025-07-11_17:48 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye
2025-07-11_18:20 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bullseye
2025-07-11_18:23 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bullseye
2025-07-11_18:24 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bullseye

cloudcephosd1006 now has a full rebuild of all OSDs and is running pacific and bookworm again. And it is crashing again, with unbounded memory use.

cloudcephosd1006 now has a full rebuild of all OSDs and is running pacific and bookworm again. And it is crashing again, with unbounded memory use.

This comment should probably go to the follow-up task T399858: Cloud Ceph misbehaving on Debian Bookworm.