Page MenuHomePhabricator

High pod latency affecting several dse-k8s-worker nodes in eqiad C/D rows
Closed, DuplicatePublic

Assigned To
Authored By
BTullis
Nov 12 2025, 12:46 PM
Referenced Files
F70139202: image.png
Nov 12 2025, 12:46 PM
F70139125: image.png
Nov 12 2025, 12:46 PM
F70139045: image.png
Nov 12 2025, 12:46 PM
F70138883: image.png
Nov 12 2025, 12:46 PM
F70138852: image.png
Nov 12 2025, 12:46 PM
F70138399: image.png
Nov 12 2025, 12:46 PM
F70138363: image.png
Nov 12 2025, 12:46 PM
F70138312: image.png
Nov 12 2025, 12:46 PM

Description

This incident seems to have been triggered by this brief outage T409800: Row C traffic outage Nov 11 2025 during which row C in eqiad suffered a network outage for around 9 minutes at 03:09.

Although connectivity to row C was restored within minutes, several of the Kubernetes workers in the dse-k8s-eqiad cluster have been exhibiting long delays and timeouts when launching pods, ever since the incident.
The result is that these hosts have had to be cordoned from the cluster, so they cannot currently accept workload.

The table below shows the pod worker latency for several dse-k8s-worker nodes increasing dramatically after the incident.
These values have dropped to zero when the node was cordoned, but would still rise again if uncordoned.

These three servers are all in row C and are all connected to the asw2-c-eqiad Juniper virtual chassis, which is due to be replaced in T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs

However, there are at least two anomalies.

  • dse-k8s-worker1013 is also connected to asw2-c5-eqiad and does not show any increased pod latency.
    image.png (1×2 px, 659 KB)
  • dse-k8s-worker1004 did not launch any pods immediately after the incident for several hours, then when rebooted it exhibited higher latency. Not as high as 2 minutes, but three times highter than normal.
    image.png (1×2 px, 366 KB)
    This host is in Rack D4

We can also see this extended time that it takes to create a pod on this node immediately after uncordoning it.

image.png (1×2 px, 762 KB)

These pods perform one or more cephfs mounts when creating the pod, so this could be related.
However these graphs indicate that perhaps the configmaps are also taking a long time to sync.

image.png (1×2 px, 535 KB)

I have performed a full reboot of the cephosd100[1-5] cluster as well as reboots of dse-k8s-worker100[3,4,11,19] but this hasn't affected anything.
All other k8s nodes seem unaffected, other than having a resource and latency bump from running the workload of the four cordoned hosts.

This is the current state:

image.png (544×1 px, 366 KB)