Page MenuHomePhabricator

High pod latency affecting several dse-k8s-worker nodes in eqiad C/D rows
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Nov 12 2025, 12:46 PM
Referenced Files
F70304511: image.png
Nov 20 2025, 6:27 PM
F70252854: image.png
Nov 17 2025, 1:01 PM
F70252699: image.png
Nov 17 2025, 12:45 PM
F70251618: image.png
Nov 17 2025, 11:21 AM
F70140410: image.png
Nov 12 2025, 2:11 PM
F70140310: image.png
Nov 12 2025, 2:11 PM
F70140280: image.png
Nov 12 2025, 2:11 PM
F70139202: image.png
Nov 12 2025, 12:46 PM

Description

This incident seems to have been triggered by this brief outage T409800: Row C traffic outage Nov 11 2025 during which row C in eqiad suffered a network outage for around 9 minutes at 03:09.

Although connectivity to row C was restored within minutes, several of the Kubernetes workers in the dse-k8s-eqiad cluster have been exhibiting long delays and timeouts when launching pods, ever since the incident.
The result is that these hosts have had to be cordoned from the cluster, so they cannot currently accept workload.

The table below shows the pod worker latency for several dse-k8s-worker nodes increasing dramatically after the incident.
These values have dropped to zero when the node was cordoned, but would still rise again if uncordoned.

These three servers are all in row C and are all connected to the asw2-c-eqiad Juniper virtual chassis, which is due to be replaced in T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs

However, there are at least two anomalies.

  • dse-k8s-worker1013 is also connected to asw2-c5-eqiad and does not show any increased pod latency.
    image.png (1×2 px, 659 KB)
  • dse-k8s-worker1004 did not launch any pods immediately after the incident for several hours, then when rebooted it exhibited higher latency. Not as high as 2 minutes, but three times highter than normal.
    image.png (1×2 px, 366 KB)
    This host is in Rack D4

We can also see this extended time that it takes to create a pod on this node immediately after uncordoning it.

image.png (1×2 px, 762 KB)

These pods perform one or more cephfs mounts when creating the pod, so this could be related.
However these graphs indicate that perhaps the configmaps are also taking a long time to sync.

image.png (1×2 px, 535 KB)

I have performed a full reboot of the cephosd100[1-5] cluster as well as reboots of dse-k8s-worker100[3,4,11,19] but this hasn't affected anything.
All other k8s nodes seem unaffected, other than having a resource and latency bump from running the workload of the four cordoned hosts.

This is the current state:

image.png (544×1 px, 366 KB)

Event Timeline

Thanks for the detailed analysis @BTullis. This is definitely very concerning.

In terms of overall network activity on (for example), dse-k8s-worker1003, in the basic tests I've been able to run I've not been able to find any issues so far. It seems able to talk to other network nodes and not show much latency:

cmooney@dse-k8s-worker1003:~$ mtr -b -w -c 10 dse-k8s-worker1008.eqiad.wmnet 
Start: 2025-11-12T13:16:11+0000
HOST: dse-k8s-worker1003                                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (2620:0:861:103:fe00::1)  0.0%    10    0.5   0.5   0.3   0.6   0.1
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.wikimedia.org (2620:0:861:fe07::2)  0.0%    10    9.8   2.3   0.7   9.8   3.0
  3.|-- irb-1037.lsw1-f3-eqiad.eqiad.wmnet (2620:0:861:10f::1)          0.0%    10    0.5   4.0   0.5  12.3   5.2
  4.|-- dse-k8s-worker1008.eqiad.wmnet (2620:0:861:10f:10:64:136:7)     0.0%    10    0.2   0.2   0.2   0.3   0.0
cmooney@dse-k8s-worker1003:~$ mtr -b -w -c 10 dse-k8s-worker1004.eqiad.wmnet 
Start: 2025-11-12T13:22:25+0000
HOST: dse-k8s-worker1003                                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (2620:0:861:103:fe00::1)  0.0%    10    0.5   0.5   0.3   0.7   0.1
  2.|-- dse-k8s-worker1004.eqiad.wmnet (2620:0:861:107:10:64:48:52)     0.0%    10    0.3   0.3   0.2   0.3   0.0
cmooney@dse-k8s-worker1003:~$ mtr -b -w -c 10 dse-k8s-worker1013.eqiad.wmnet 
Start: 2025-11-12T13:18:49+0000
HOST: dse-k8s-worker1003                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- dse-k8s-worker1013.eqiad.wmnet (2620:0:861:103:10:64:32:93)  0.0%    10    0.2   0.3   0.2   0.5   0.1

But we will keep digging. Certainly it would appear the incident we had shortly after 3am on Nov 11th was the trigger for what we are seeing. I guess what I'm anxious to discover now is:

  • Did the network incident at that time trigger a problem that now persists, despite the network being ok?
  • Or is there an ongoing network problem somewhere?

If needs be we can revert all the network changes done as part of the row C/D migration (see T405562 and T404609), however that is a fairly drastic move. And given this problem did not occur immediately after the last of those changes (done last Friday) it is not clear they themselves are the root cause. Nevertheless things were stable for a long time before we started making changes, so if it comes to it I've no problem doing that.

Tests

For now I'd like to see if we can reproduce any problematic behaviour, and try to asses what is going on with that at the network layer.

For the time being I have disabled one of the redundant uplinks between our new Nokia spine switches and asw2-c-eqiad. Purely to see if somehow we have frames looping at layer-2 which could be playing a role (this seems unlikely, but some of the logs from the incident we had suggested it may be happening). Initial signs after taking that link down are it has not made any difference to connectivity from dse-k8s-worker1003. It would be interesting to try and re-test some of the higher-level operations that were failing following this change and see if they are still present.

We just did a test uncordoning of dse-k8s-worker1003.

Immediately upon uncordining, lots of pods were assigned, but got stuck in the ContainerCreating stage for tens of seconds.

image.png (614×2 px, 443 KB)

However, they do eventually run and without errors.

image.png (1×2 px, 818 KB)

This latency of 30-40 seconds does seem better than previously, where it was frequently timing out at 2 minutes.
i.e. This is the pod latency now:

image.png (1×2 px, 219 KB)

Compared with:
image.png (1×2 px, 555 KB)

However, 30-40 seconds is still very slow, compared with other nodes running identical workloads.

I don't know whether that'd be a valuable test or not, but we could try to reboot that host and uncordon it post reboot, to see if the issue persists.

I don't know whether that'd be a valuable test or not, but we could try to reboot that host and uncordon it post reboot, to see if the issue persists.

Thanks for the suggestion. I did try that yesterday, but it looks like I didn't keep a good record of it.

btullis@dse-k8s-worker1003:~$ journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
 -1 b254226248b84211bc9b0a866752ccee Mon 2025-10-27 14:15:59 UTC Tue 2025-11-11 20:34:39 UTC
  0 db5d5c973fc34298a2520c99a27d63b4 Tue 2025-11-11 20:37:33 UTC Wed 2025-11-12 15:08:43 UTC

I think that @cmooney is going to re-enable the redundant uplink to the Nokia, to see if this makes the behaviour worse to the point it was before.

dse-k8s-worker1003 moved to new switch stack per @cmooney's request.

Just a few more datapoints on this one.

Throughput test

For the sake of it I did an perf test to dse-k8s-worker1003 from bast1003, and we had no problem maxing the bastions available 1Gb/sec:

cmooney@dse-k8s-worker1003:~$ iperf -s -w512k
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 1000 KByte (WARNING: requested  500 KByte)
------------------------------------------------------------
[  1] local 10.64.32.178 port 5001 connected with 208.80.155.110 port 51976 (icwnd/mss/irtt=14/1448/161)
[  2] local 10.64.32.178 port 5001 connected with 208.80.155.110 port 51992 (icwnd/mss/irtt=14/1448/135)
[  3] local 10.64.32.178 port 5001 connected with 208.80.155.110 port 52002 (icwnd/mss/irtt=14/1448/124)
[  4] local 10.64.32.178 port 5001 connected with 208.80.155.110 port 52018 (icwnd/mss/irtt=14/1448/121)
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-30.0449 sec   843 MBytes   235 Mbits/sec
[  2] 0.0000-30.0469 sec   843 MBytes   235 Mbits/sec
[  1] 0.0000-30.0523 sec   843 MBytes   235 Mbits/sec
[  3] 0.0000-30.0483 sec   843 MBytes   235 Mbits/sec
[SUM] 0.0000-30.0519 sec  3.29 GBytes   941 Mbits/sec
Packet capture

We also took a pcap while @BTullis uncordoned the host. The file is a few GB in total so I can't put it here but a few observations on the top converstations it contains (CSV of them is here: P85317). The main conclusion is that everything looks ok, there is no evidence of a networking problem or any hosts the system is having problems communicating with. So unfortunately it doesn't get us any closer to knowing what is going on.

10.64.32.85 to 10.67.27.251 - 123MB - TCP port 15020

prometheus1008 scraping istio-ingressgateway.istio-system.svc mostly seems ok, RTT also ok <2ms

10.64.48.171 to 10.67.27.251 - 123MB - TCP port 15020

prometheus1007 scraping istio-ingressgateway.istio-system.svc mostly seems ok, RTT also ok <2ms

10.2.2.73 - dse-k8s-ctrl.svc.eqiad.wmnet - 60MB TCP port 6443

The RTT seems high, up to 50ms in some cases. However the latency to this front-end IP is fine:

cmooney@dse-k8s-worker1003:~$ mtr -b -w -c 100 10.2.2.73 
Start: 2025-11-13T13:13:27+0000
HOST: dse-k8s-worker1003                                 Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)  0.0%   100    0.5   1.2   0.3  29.9   3.8
  2.|-- dse-k8s-ctrl.svc.eqiad.wmnet (10.2.2.73)            0.0%   100    0.1   0.1   0.1   0.3   0.0

If the problem was with the backend from the load-balancer then all servers would be affected. So my guess is this is application delay, and this RTT is normal.

10.64.136.7 - dse-k8s-worker1008 - 47MB - TCP port 5473

This one looks rock-solid, though there are occasional RTT spikes. Again these do not present on the wire so I expect it is just when the application at the far side takes longer to respond. Zero DUPs or retransmits at all, target host is in rack F3.

cmooney@dse-k8s-worker1003:~$ mtr -b -w -c 100 dse-k8s-worker1008.eqiad.wmnet 
Start: 2025-11-13T14:41:43+0000
HOST: dse-k8s-worker1003                                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (2620:0:861:103:fe00::1)  0.0%   100    0.4   1.0   0.3  35.0   4.1
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.wikimedia.org (2620:0:861:fe07::2)  0.0%   100    0.7   3.7   0.4  33.7   6.5
  3.|-- irb-1037.lsw1-f3-eqiad.eqiad.wmnet (2620:0:861:10f::1)          0.0%   100    0.7   1.3   0.5  39.9   4.1
  4.|-- dse-k8s-worker1008.eqiad.wmnet (2620:0:861:10f:10:64:136:7)     0.0%   100    0.2   0.2   0.1   0.4   0.1

2620:0:861:102:10:64:16:86 - centrallog1002.eqiad.wmnet - 39MB - TCP port 6514

Again no signs of issues. RTT is good consistently, no DUPs/retransits etc.

2620:0:860:102:10:192:16:35 - centrallog2002.codfw.wmnet - 37MB - TCP port 6514

Again seems ok. Higher RTT but it is going to codfw.

10.64.131.21 - cephosd1002.eqiad.wmnet - 32MB + 25MB - TCP port 6801

All the TCP 6801 traffic in the capture is to this host. Looks fine that I can see no DUPs/retransmits etc.

Occasionally the RTT spikes to 40-50ms, however this also coincides with the arrival of larger segments. My reading of it is most of the traffic is keepalive for the mount, RTT increases when actual data read/written, and probably ok. The RTT spikes don't seem visible on the wire:

cmooney@dse-k8s-worker1003:~$ mtr -4 -b -w -c 1000 cephosd1002.eqiad.wmnet. 
Start: 2025-11-14T13:13:18+0000
HOST: dse-k8s-worker1003                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.3   0.7   0.2  40.7   3.0
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000   10.8   6.0   0.6  57.6   6.7
  3.|-- irb-1032.lsw1-e2-eqiad.eqiad.wmnet (10.64.131.1)     0.0%  1000    4.2   7.5   0.5  88.4   6.0
  4.|-- cephosd1002.eqiad.wmnet (10.64.131.21)               0.0%  1000    0.2   0.1   0.1   0.3   0.1

2620:0:861:107:10:64:48:66 - kafka-logging1003.eqiad.wmnet - 9MB TCP port 9093

This also looks fine, no DUPs, retransmits etc. RTT is consistently under 0.5ms.

Postgress-airflow TCP 5432

There is also a bunch of traffic from various POD IPs running on the host, to remote PODs on TCP port 5432.

These destinations are all named 'postgresql-airflow-<something>'. Again no signs of issues.

an-worker - TCP 50010

There are a also a bunch of connections from POD IPs to various an-worker hosts on TCP 50010. Again this seems fine no signs of problems.

We're still not much further forward with working out the root cause of this.

dse-k8s-worker1003 has been moved from asw2-c4-eqiad (Juniper) to lsw1-c4-eqiad (Nokia) - then we uncordoned it but it didn't make an difference to the pod worker latency.
It's still taking 10s of seconds to launch any airflow task pods.

I think that the next step in troubleshooting will be to create some test case pods that only have configmaps loaded, but no cephfs volumes.
If we can start these pods on dse-k8s-worker1003, then we will be able to see if it is something specific to cephfs.

Perhaps it could be something else specific to the airflow task pod definition that is causing the delays.

Here's the latest strategy on this issue.

We're going to look at uncordoning dse-k8s-worker1019 next. The reason for this is that it has 1 TB of memory and it would be useful to have this resource back in the cluster as soon as is practicable. This host is in rack C6 and it is still connected to one of the Juniper virtual chassis switches, asw2-c6-eqiad.

Whatever the outcome of this test, we are going to move on from simply cordoning the hosts, to using taints and tolerations instead.
By adding a suitable taint to the hosts that are showing this aberrant behaviour, we will be able to keep them in service, but only for certain test workloads that specify a matching toleration.
This will help us to debug the issue, without adversely affecting other workloads on the cluster.

I have uncordoned dse-k8s-worker1019.

root@deploy2002:~# kubectl uncordon dse-k8s-worker1019.eqiad.wmnet
node/dse-k8s-worker1019.eqiad.wmnet uncordoned

Now keeping an eye on this graph to see if the pod worker latencies are still in the order of tens of seconds: https://grafana.wikimedia.org/goto/-zWcTFmvR?orgId=1

Latency for dse-k8s-worker1019 looks pretty good, now. Back to normal, I would say.

image.png (956×1 px, 131 KB)

Proceeding to uncordon dse-k8s-worker1011.

The latency for dse-k8s-worker1019 is also OK now.

image.png (937×1 px, 137 KB)

Proceeding to uncordon dse-k8s-worker1004, which is in rack D4.

Latency for dse-k8s-worker1004 is still considerably higher than normal, in the region of 12 to 14 seconds to start a pod.
https://grafana.wikimedia.org/goto/KprfPFmvR?orgId=1

image.png (957×1 px, 127 KB)

So it looks like it is just this host and dse-k8s-worker1003 that remain affected.

Looks like it's slowly getting better as well.

If we need to look more into is, I'd suggest running a packet capture (while filtering out as much as possible to not end up with a multiple gig file) during those 20s pod creation time to look at where the extra latency is.
Per Cathal tests the latency don't seem to be once the TCP sessions are live, so it could be that the sessions somehow take longer to get created.

Thanks @ayounsi - Yes, I think that it is generally getting better, but it's still troubling.
For example, I just drained, rebooted, then uncordoned dse-k8s-worker100[5-6]

When I uncordoned dse-k8s-worker1006, the pod latency is still unusually high.

image.png (958×1 px, 138 KB)

https://grafana.wikimedia.org/goto/rE-1RBmvg?orgId=1

I'll check back and continue the gradual rolling reboot.

I appreciate the guidance about packet captures. It might still come to that.

If we need to look more into is, I'd suggest running a packet capture (while filtering out as much as possible to not end up with a multiple gig file) during those 20s pod creation time to look at where the extra latency is.

Yeah we already did that. Tbh I'm not sure how much we're gonna be able to glean from another packet capture. Definitely an odd issue, and confined to the rows with the Nokia switches (though not just row C where we had T409800). But I'm not convinced there is an ongoing network problem, given we haven't been able to find any fault in tests and the pcap showed the network seemed to be working well.

I really think the only way we'll work this out is if we can trace the various things that occur during pod creation, determine what exact operations are taking so long and see if there is any debugging we can do on them.

Happy to do more network tests / packet captures but I fear we may have exhausted that avenue.

Tentatively closing this issue. I've checked out the pod worker latencies on all nodes and they now seem to be relatively stable.