While working on T306820: [ceph] Upgrade to v16, we observed the following:
- all cloudceph hosts were upgraded to Ceph v16, no issues
- 1 cloudcephmon and 6 cloudcephosd hosts were upgraded to Debian Bookworm, no immediate issues
- after a few hours, things broke loose: T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures
- after downgrading all 6 hosts to Bullseye, things went back to normal
This task is to investigate what is the issue that caused Ceph to misbehave on the upgraded hosts.
Some graphs from the incident doc:
CPU usage on the affected hosts is high. This is depicted both in the percentiles graph as well individually
Memory usage in also high on the affected hosts, explaining swap usage and md resync (which happens on first access)
Running processes are really weird for the affected hosts:
And similarly Disk utilization











