This incident seems to have been triggered by this brief outage T409800: Row C traffic outage Nov 11 2025 during which row C in eqiad suffered a network outage for around 9 minutes at 03:09.
Although connectivity to row C was restored within minutes, several of the Kubernetes workers in the dse-k8s-eqiad cluster have been exhibiting long delays and timeouts when launching pods, ever since the incident.
The result is that these hosts have had to be cordoned from the cluster, so they cannot currently accept workload.
The table below shows the pod worker latency for several dse-k8s-worker nodes increasing dramatically after the incident.
These values have dropped to zero when the node was cordoned, but would still rise again if uncordoned.
| Hostname | Pod worker latency | Link | Rack | Switch |
| dse-k8s-worker1003 | here | C4 | asw2-c4-eqiad | |
| dse-k8s-worker1011 | here | C5 | asw2-c5-eqiad | |
| dse-k8s-worker1019 | here | C6 | asw2-c6-eqiad | |
These three servers are all in row C and are all connected to the asw2-c-eqiad Juniper virtual chassis, which is due to be replaced in T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs
However, there are at least two anomalies.
- dse-k8s-worker1013 is also connected to asw2-c5-eqiad and does not show any increased pod latency.
- dse-k8s-worker1004 did not launch any pods immediately after the incident for several hours, then when rebooted it exhibited higher latency. Not as high as 2 minutes, but three times highter than normal. This host is in Rack D4
We can also see this extended time that it takes to create a pod on this node immediately after uncordoning it.
These pods perform one or more cephfs mounts when creating the pod, so this could be related.
However these graphs indicate that perhaps the configmaps are also taking a long time to sync.
I have performed a full reboot of the cephosd100[1-5] cluster as well as reboots of dse-k8s-worker100[3,4,11,19] but this hasn't affected anything.
All other k8s nodes seem unaffected, other than having a resource and latency bump from running the workload of the four cordoned hosts.
This is the current state:














