- Toolforge tools were not responding to http requests (Tools-proxy-9 was returning an error page)
- We found that Ceph had intermittent issues since last night, after some hosts were upgraded to Bookworm
- This caused intermittent issues to both Toolforge and Cloud VPS
- We downgraded back to Bullseye the 6 hosts that were previously upgraded to Bookworm
- Cloudcephosd1006
- Cloudcephosd1007
- Cloudcephosd1008
- Cloudcephosd1035
- Cloudsephosd1036
- Cloudcephosd1037
Incident doc: https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0#heading=h.nz4dlhpgbsjm
Incident Report: https://wikitech.wikimedia.org/wiki/Incidents/2025-07-11_WMCS_Ceph_issues_causing_Toolforge_and_Cloud_VPS_failures
Timeline (UTC)
01:59 PROBLEM - SSH on cloudcephosd1035 is CRITICAL
02:08 RECOVERY - SSH on cloudcephosd1035 is OK
04:20 SSH on cloudcephosd1036 is CRITICAL
05:02 RECOVERY - SSH on cloudcephosd1036 is OK
05:28 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
05:31 RECOVERY - SSH on cloudcephosd1037 is OK
07:13 many cloud-vps hosts reported down by prometheus, but nobody notices
07:14 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
07:30 RECOVERY - SSH on cloudcephosd1037 is OK
07:16 FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
07:19 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
07:20 cloud-vps back to normal (no hosts reported down)
07:21 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 779 slow ops
07:24 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra
07:27 FIRING: CephSlowOps: Ceph cluster in eqiad has 908 slow ops
07:32 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 908 slow ops
08:08 many cloud-vps hosts again reported down
08:10 FIRING: CephSlowOps: Ceph cluster in eqiad has 1678 slow ops
08:13 PROBLEM - SSH on cloudcephosd1036 is CRITICAL
08:18 wmcs-dnsleaks fails on cloudcontrol1007 (possibly unrelated)
08:18 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
08:19 cloud-vps back to normal (no hosts reported down)
08:20 Manuel reports switchmaster.toolforge.org is down
08:23 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra
08:23 FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)
08:27 lucas.werkmeister@wikimedia.de reports all tools are returning an error from tools-proxy-9
08:39 <lucaswerkmeister> I can SSH into tools-proxy-9, the only failed systemd unit is logrotate which judging by the journal has been broken for a long time, probably not related
08:44 <lucaswerkmeister> I think tools-proxy-9 times out trying to reach k8s.tools.eqiad1.wikimedia.cloud in turn
08:44 <lucaswerkmeister> I can SSH into that one too, no high load there either
08:52 Incident opened. Francesco Negri becomes IC.
08:55 Toolforge is working again. No action was taken.
08:58 RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)
09:01 Incident is resolved (temporarily)
09:25 Francesco Negri starts wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-77, tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-37, as they were alerting with “many processes in D state”
09:28 SSH on cloudcephosd1036 is OK
09:39 PROBLEM - SSH on cloudcephosd1035 is CRITICAL
09:42 RECOVERY - SSH on cloudcephosd1035 is OK
10:20 PROBLEM - SSH on cloudcephosd1036 is CRITICAL
10:21 FIRING: CephSlowOps: Ceph cluster in eqiad has 847 slow ops
10:24 RECOVERY - SSH on cloudcephosd1036 is OK
10:26 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1386 slow ops
10:28 FIRING: CephSlowOps: Ceph cluster in eqiad has 5134 slow ops
10:33 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1272 slow ops
11:17 <andrewbogott> I'm still half asleep and haven't read the backscroll, but my emails suggest that ceph pacific + bookworm + ceph traffic is a bad combination.
11:18 <andrewbogott> So probably the fix for this is for me to downgrade those hosts back to bullseye.
11:23 <andrewbogott> The bookworm hosts are 1006-1008,1035-1037
11:26 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037
11:35 SSH on cloudcephosd1008 is CRITICAL
11:41 SSH on cloudcephosd1008 is OK
11:41 FIRING: WidespreadInstanceDown
11:46 RESOLVED: WidespreadInstanceDown
12:13 SSH on cloudcephosd1035 is CRITICAL
12:17 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037
12:18 FIRING: WidespreadInstanceDown
12:20 SSH on cloudcephosd1035 is OK
12:20 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037
12:23 RESOLVED: WidespreadInstanceDown
12:44 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037
13:11 cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037
14:12 FIRING: WidespreadInstanceDown
14:24 Reopening the incident
14:29 <dhinus> things seem to get worse after 14:12 UTC
14:30 <andrewbogott> 1007 is frozen right now. So we /do/ have two [ceph hosts] down at once, which could maybe explain current bad behavior.
14:41 <dhinus> we have now 9 OSDs down (compared to 16 before) – [one ceph host recovered]
14:58 (Slack) https://wikipedialibrary.wmflabs.org/ is down right now too.
15:08 (Slack) Seeing failures on catalyst environments too
15:10 Many Cloud VPS VMs are down (29% of VMs in project “tools”)
15:14 <dhinus> many cloud vps VMs are still not working, and are not recovering for <reasons>
15:16 <dhinus> ceph IOPS are at about 50% of what they were this morning
15:16 <andrewbogott> ok, 1037 is up, now there's just a bit of pg shuffling to do before we can stop another host
15:16 <dhinus> do you know why we still have 1 OSD down?
15:18 <andrewbogott> I just checked, that's on 1013 which as far as I know hasn't suffered any recent maintenance. The down OSD is associated with a volume that doesn't appear in lsblk so... a mystery but /probably/ an unrelated one.
15:19 <andrewbogott> ceph is still recovering, down to 514 pgs
15:23 <dhinus> I tried manually rebooting a couple of VMs, and they do come back... but it will take a looooong time if we need to reboot all manually
15:29 <dhinus> count(up{job="node"} == 0) is finally looking good – All VMs are now reporting as healthy
15:34 <+wm-bb> <Vincent> My tool is up and running now, thank you :)
15:47 <+wm-bb> <Yetkin> My tool is up and running as well 😊
15:50 Handing off IC to Andrew Bogott
16:09 alertmanager is complaining about OpenstackAPIResponse, slow response times only for designate-api
16:11 <andrewbogott> I'm restarting designate services
17:25 Finished reimaging of cloudcephosd1035. Remaining Bookworm OSD nodes are cloudcephosd100[6-8].
18:33 All OSDs back to Bullseye, ceph shows no misplaced objects.
18:33 Incident closed