While working on {T306820}, we observed the following:
* all cloudceph hosts were upgraded to Ceph v16, no issues
* 1 cloudcephmon and 6 cloudcephosd hosts were upgraded to Debian Bookworm, no immediate issues
* after a few hours, things broke loose: {T399281}
* after downgrading all 6 hosts to Bullseye, things went back to normal
This task is to investigate what is the issue that caused Ceph to misbehave on the upgraded hosts.
Some graphs from the [incident doc](https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0):
CPU usage on the affected hosts is high. This is depicted both in the [[ https://grafana-rw.wikimedia.org/d/000000607/cluster-overview?forceLogin&from=now-24h&orgId=1&timezone=utc&to=now&var-cluster=wmcs&var-instance=cloudcephosd1004&var-instance=cloudcephosd1005&var-instance=cloudcephosd1006&var-instance=cloudcephosd1007&var-instance=cloudcephosd1008&var-instance=cloudcephosd1009&var-instance=cloudcephosd1010&var-instance=cloudcephosd1011&var-instance=cloudcephosd1012&var-instance=cloudcephosd1013&var-instance=cloudcephosd1014&var-instance=cloudcephosd1015&var-instance=cloudcephosd1016&var-instance=cloudcephosd1017&var-instance=cloudcephosd1018&var-instance=cloudcephosd1019&var-instance=cloudcephosd1020&var-instance=cloudcephosd1021&var-instance=cloudcephosd1022&var-instance=cloudcephosd1023&var-instance=cloudcephosd1024&var-instance=cloudcephosd1026&var-instance=cloudcephosd1027&var-instance=cloudcephosd1028&var-instance=cloudcephosd1029&var-instance=cloudcephosd1030&var-instance=cloudcephosd1031&var-instance=cloudcephosd1032&var-instance=cloudcephosd1033&var-instance=cloudcephosd1034&var-instance=cloudcephosd1035&var-instance=cloudcephosd1036&var-instance=cloudcephosd1037&var-instance=cloudcephosd1038&var-instance=cloudcephosd1039&var-instance=cloudcephosd1040&var-instance=cloudcephosd1041&var-site=eqiad&viewPanel=panel-910 | percentiles ]] graph as well [[ https://grafana.wikimedia.org/goto/eT3zyesNg?orgId=1 | individually ]]
{F65056617}
{F65056905}
[[ https://grafana.wikimedia.org/goto/CBq1E6yHg?orgId=1 |
Memory usage ]] in also high on the affected hosts, explaining swap usage and md resync (which happens on first access)
{F65056708}
[[ https://grafana.wikimedia.org/goto/FcEv86sHR?orgId=1 |
Running processes ]] are really weird for the affected hosts:
{F65056754}
And similarly [[ https://grafana.wikimedia.org/d/000000377/host-overview?from=2025-07-09T15:03:27.925Z&to=2025-07-11T15:03:27.925Z&timezone=utc&var-site=eqiad&var-cluster=wmcs&var-instance=cloudcephosd1004&var-instance=cloudcephosd1005&var-instance=cloudcephosd1006&var-instance=cloudcephosd1007&var-instance=cloudcephosd1008&var-instance=cloudcephosd1009&var-instance=cloudcephosd1010&var-instance=cloudcephosd1011&var-instance=cloudcephosd1012&var-instance=cloudcephosd1013&var-instance=cloudcephosd1014&var-instance=cloudcephosd1015&var-instance=cloudcephosd1016&var-instance=cloudcephosd1017&var-instance=cloudcephosd1018&var-instance=cloudcephosd1019&var-instance=cloudcephosd1020&var-instance=cloudcephosd1021&var-instance=cloudcephosd1022&var-instance=cloudcephosd1023&var-instance=cloudcephosd1024&var-instance=cloudcephosd1026&var-instance=cloudcephosd1027&var-instance=cloudcephosd1028&var-instance=cloudcephosd1029&var-instance=cloudcephosd1030&var-instance=cloudcephosd1031&var-instance=cloudcephosd1032&var-instance=cloudcephosd1033&var-instance=cloudcephosd1034&var-instance=cloudcephosd1035&var-instance=cloudcephosd1036&var-instance=cloudcephosd1037&var-instance=cloudcephosd1038&var-instance=cloudcephosd1039&var-instance=cloudcephosd1040&var-instance=cloudcephosd1041&orgId=1&var-server=cloudcephosd1035&var-datasource=000000026&refresh=5m&viewPanel=panel-6 | Disk utilization ]]
{F65056781}