Today we got an alert of Toolforge k8s workers with many D procs:
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T403043 [ceph] 2025-08-27 ceph outage when bringing in a big osd host all at once (cloudcephosd1048) | |||
| Resolved | dcaro | T373632 CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes | |||
| Resolved | • aborrero | T374612 toolforge: workers with many D procs (2024-09-12 edition) |
Event Timeline
Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-12T11:37:26Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-28 (T374612)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-12T11:42:57Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-28 (T374612)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-12T11:48:47Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-23, tools-k8s-worker-16, tools-k8s-worker-nfs-33 (T374612)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-12T11:54:13Z] <aborrero@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-23, tools-k8s-worker-16, tools-k8s-worker-nfs-33 (T374612)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-12T11:59:42Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-33 (T374612)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-12T12:06:00Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-33 (T374612)
