Page MenuHomePhabricator

Socket leaking on some dse-k8s row C & D hosts
Open, MediumPublic

Description

Context:

Since November 24th / 25th we have observed an increasing amount of TCP-inuse sockets being reported on eqiad hosts in row C & D:

There are unaffected other dse-k8s-eqiad hosts on row C & D:

All the connections are in FIN_WAIT or CLOSING status, and all are directed to cephosd hosts.

brouberol@dse-k8s-worker1019:~$ sudo netstat -laputen | grep -Pe "(FIN_WAIT|CLOSING)"  | awk '{ print $5 }' | cut -d: -f 1 | sort | uniq  -c
   2125 10.64.130.13
   1549 10.64.131.21
   1592 10.64.132.23
   1518 10.64.134.12
   1695 10.64.135.21
brouberol@dse-k8s-worker1019:~$ sudo netstat -laputen | grep CLOSING |head  | awk '{ print $5 }' | cut -d: -f 1 | sort | uniq | xargs -n1 host
13.130.64.10.in-addr.arpa domain name pointer cephosd1001.eqiad.wmnet.
21.131.64.10.in-addr.arpa domain name pointer cephosd1002.eqiad.wmnet.
23.132.64.10.in-addr.arpa domain name pointer cephosd1003.eqiad.wmnet.
12.134.64.10.in-addr.arpa domain name pointer cephosd1004.eqiad.wmnet.
21.135.64.10.in-addr.arpa domain name pointer cephosd1005.eqiad.wmnet.

It is interesting to note that for host dse-k8s-worker1010.eqiad.wmnet a reboot has not solved the issue.

image.png (1×2 px, 171 KB)

Event Timeline

Huh yeah this is quite odd alright.

Taking dse-k8s-worker1011 and dse-k8s-worker1013 as two example hosts to test, as they are both in rack C5 on the same vlan. They were both moved on the evening of November 24th to the new switches.

Looking at the number of TCP retransmits on dse-k8s-worker1013 it does seem to have had a marked increase, whereas dse-k8s-worker1011 does not:

https://grafana.wikimedia.org/goto/O8qzk64vg?orgId=1

I also see packet loss from 1013 to a test ceph host, but not from 1011:

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-13T18:27:00+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.8%  1000    0.3   0.8   0.2  39.2   3.2
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.4%  1000    7.2   5.9   0.5  73.7   7.2
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.9%  1000    0.3   0.1   0.1   3.2   0.2
cmooney@dse-k8s-worker1011:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-13T18:42:21+0000
HOST: dse-k8s-worker1011                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.5   0.7   0.2  47.4   3.1
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000    4.5   5.5   0.5  60.6   5.9
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.2   0.1   0.1   4.4   0.3

Looking at the specific uplinks I see no reported errors on the host interfaces themselves, the switch ports they are connected to, or on the links between lsw1-c5-eqiad and the spine switches, nor from the spine switches to the CR routers. So I'm somewhat at a loss to account for this.

While it is kind of clutching at straws but I think it might be worthwhile to replace the DAC cable connecting dse-k8s-worker1013 to lsw1-c5-eqiad. Just to see if that brings any improvement. Perhaps the new switch is less happy with this module than the old one, or simply the act of moving it caused some wear and tear that is causing this.

I took a quick look at the state of sockets on dse-k8s-worker1010, since FIN_WAIT_1 is not supposed to stick around for longer than a minute or two. Increasingly-conplicated ss flags showed that there's a busy: field available, reporting a number of milliseconds. This lets us do some dating on the sockets that are still around:

1 1745 2025-12-04
2 846 2026-01-02
3 615 2025-12-03
4 579 2025-12-05
5 504 2026-01-03
6 421 2025-12-11
7 419 2025-12-17
8 408 2025-12-10
9 368 2025-12-16
10 345 2025-12-21
11 340 2025-12-07
12 338 2025-12-12
13 283 2026-01-05
14 267 2025-12-19
15 262 2025-12-24
16 250 2025-12-28
17 250 2025-12-18
18 247 2026-01-01
19 232 2025-12-29
20 223 2025-12-30
21 222 2025-12-31
22 214 2025-12-25
23 206 2025-12-15
24 205 2026-01-04
25 204 2025-12-08
26 201 2025-12-22
27 193 2025-12-26
28 191 2025-12-23
29 189 2025-12-14
30 177 2025-12-09
31 173 2026-01-06
32 171 2025-12-01
33 160 2025-12-02
34 159 2026-01-10
35 157 2026-01-12
36 157 2026-01-09
37 155 2026-01-11
38 151 2026-01-07
39 147 2026-01-08
40 125 2025-12-13
41 112 2026-01-13
42 107 2025-12-20
43 93 2025-12-27
44 28 2025-12-06

An hourly breakdown is available in P87477 and a no-good very horrible methodology is available at P87479.

The spike a few days after the start of the month is interesting -- it almost feels like these correlate with when there's lots of activity on the cluster?

Ceph clients are in-kernel, right?

Sockets aren't supposed to hang around in FIN_WAIT_1 forever (and I think the userspace API doesn't even make that possible?) -- but this kernel commit describes a similar 'leak' of FIN_WAIT_1 sockets from within the in-kernel CIFS client. Maybe something similar is happening here?

(Oh, as one small bit of corroboration, there were no timers listed in the ss output for the sockets in this state, whereas there were for the ESTAB ones currently open towards cephosd*)

Thanks @CDanis, yeah in terms of the TCP state machine I wasn't quite sure how the apparent packet loss translated to the increase in those.

But that is indeed a bit crazy, I should have looked, I assumed they were all "recent". There are almost 2,000 sockets in FIN_WAIT state since Dec 4th? I'm really confused now.

The spike a few days after the start of the month is interesting -- it almost feels like these correlate with when there's lots of activity on the cluster?

The correlation is not really strong: activity peaks on the cluster right at the beginning of the month (1st) and then decreases. See this pods per rack chart.

cmooney added a subscriber: VRiley-WMF.

@VRiley-WMF I'll ping you on irc but we want to go ahead and replace the DAC on dse-k8s-worker1013 in rack C5 when you are on-site thanks. The node is drained so we can go ahead whenever you are ready.

Hmm so I was going to see if there was any difference if I did a trace to the ceph node from this host on a different vlan. However to verify I re-ran the same mtr as I did yesterday a short time ago (now that it's cordoned off), and it seems the loss is gone.

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-14T11:45:35+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.4   1.1   0.3  42.6   4.0
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000   54.2   6.0   0.7  96.9   7.9
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.3   0.2   0.1   2.1   0.1

So I'm not sure how much this will tell us. It might make sense to uncordon the host and re-test to see does the loss return.

Also @VRiley-WMF it seems this is actually a 1G RJ45 link. So let's swap the copper SFP-T in the switch first, that is most likely to be the problem if it's something local with the link.

Hmm so with the node un-cordoned the loss has not returned either, well one drop at the first hop but it seems insignificant:

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-14T12:35:28+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.1%  1000    0.5   0.9   0.3  41.6   3.9
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000    1.4   6.1   0.5  96.2   8.0
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.3   0.1   0.1   1.3   0.1

FIN_WAIT_1 is not supposed to stick around for longer than a minute or two.

I noticed all these connections have a send-queue with '1' in it:

cmooney@dse-k8s-worker1013:~$ sudo ss -tulpna  | egrep "FIN-WAIT|^Netid"
Netid State      Recv-Q Send-Q                Local Address:Port                    Peer Address:Port Process                                                     
tcp   FIN-WAIT-1 0      1                       10.64.32.93:41280                   10.64.132.23:6909                                                             
tcp   FIN-WAIT-1 0      1                       10.64.32.93:37596                   10.64.132.23:6955                                                             
tcp   FIN-WAIT-1 0      1                       10.64.32.93:56894                   10.64.132.23:6955                                                             
tcp   FIN-WAIT-1 0      1                       10.64.32.93:55668                   10.64.130.13:6803

They all show "unacked:1" in the more detailed output too:

cmooney@dse-k8s-worker1013:~$ sudo ss -OiTtpor state fin-wait-1 | head -2
Recv-Q Send-Q                  Local Address:Port             Peer Address:PortProcess                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
0      1      dse-k8s-worker1013.eqiad.wmnet:41280 cephosd1003.eqiad.wmnet:6909 ts sack cubic wscale:9,9 rto:208 rtt:5.369/10.393 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:1160 bytes_acked:1161 bytes_received:10882 segs_out:12 segs_in:15 data_segs_out:6 data_segs_in:11 send 21.6Mbps lastsnd:2404661724 lastrcv:2404661936 lastack:2404661684 pacing_rate 43.1Mbps delivery_rate 152Mbps delivered:7 app_limited busy:2404590928ms unacked:1 rcv_space:14480 rcv_ssthresh:57708 minrtt:0.126 snd_wnd:43520
cmooney@dse-k8s-worker1013:~$ sudo ss -OiTtpor state fin-wait-1 | wc -l 
1469
cmooney@dse-k8s-worker1013:~$ sudo ss -OiTtpor state fin-wait-1 | grep -v "unacked:1"
Recv-Q Send-Q                  Local Address:Port             Peer Address:PortProcess                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
cmooney@dse-k8s-worker1013:~$

So to an extent that might make sense. The k8s host sent a FIN to the remote side but due to the packet-loss issue the remote side didn't get it, or it did and the ACK for it wasn't received. Which explains it not moving to FIN-WAIT-2, however surely it should try to resend the FIN, and if this state persists eventually just delete the connection?

however surely it should try to resend the FIN, and if this state persists eventually just delete the connection?

Digging more into what those stats mean, "rto" is the "retransmit timeout". This should start decrementing from the time the last packet was sent, and when it hits zero the system should re-send it. But it seems in this case it is stuck, checking a given connection the rto stays at 208 constantly, while lastsnd continues to increment (indicating no packet has been sent).

So the normal retransmit mechanism seems to be stuck. Normally it should retry until it hits net.ipv4.tcp_retries2 or net.ipv4.tcp_orphan_retries and then remove the connection state. But because it is not retrying these just stay in limbo.

The switch move/bad SFP-T module may be resulting in slightly higher packet loss than we'd like here, hence this bug is being hit more often since the move, but it's fairly clear there is a bug on the system preventing normal cleanup from happening.

The k8s host sent a FIN to the remote side but due to the packet-loss issue the remote side didn't get it, or it did and the ACK for it wasn't received. Which explains it not moving to FIN-WAIT-2, however surely it should try to resend the FIN, and if this state persists eventually just delete the connection?

That's why I'm suggesting the kernel bug angle here -- the socket got closed in such a way that it deleted the state transition timers, so this never happens.

cmooney removed a subscriber: VRiley-WMF.

The SFP module in port 14 of lsw1-c5-eqiad has been swapped out now. So we can observe over the next while if it makes any difference.

Ok currently seeing no loss (though that was the case when we were cordoned before the swap).

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-14T15:58:29+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.4   0.8   0.3  39.9   3.5
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000    3.0   5.6   0.6  51.3   6.2
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.2   0.2   0.1   1.9   0.1
root@dse-k8s-worker1013:~# mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet 
Start: 2026-01-14T16:09:31+0000
HOST: dse-k8s-worker1013                                 Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- irb0-1069.lsw1-c5-eqiad.eqiad.wmnet (10.64.171.1)   0.0%  1000    0.2   0.2   0.2   1.6   0.1
  2.|-- ???                                                100.0  1000    0.0   0.0   0.0   0.0   0.0
  3.|-- et-0-0-29.ssw1-f1-eqiad.eqiad.wmnet (10.64.147.11)  0.0%  1000    0.9   5.4   0.6  78.5   5.7
  4.|-- irb-1031.lsw1-e1-eqiad.eqiad.wmnet (10.64.130.1)    0.0%  1000    7.7   5.9   0.5 121.4   7.9
  5.|-- cephosd1001.eqiad.wmnet (10.64.130.13)              0.0%  1000    0.2   0.1   0.1   3.9   0.2

We can probably uncordon and run these again see if the loss starts showing up again.

Host dse-k8s-worker1013.eqiad.wmnet rebooted by brouberol@cumin1003 with reason: Getting a clean slate post networking adapter replacement

Happy to help with this. Let us know if there is anything else we can help with.

Thanks @VRiley. Happy to say we aren't seeing any loss as of yet after the node was uncordoned:

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-14T17:06:56+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.3   1.0   0.2  81.4   4.2
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000    4.8   5.9   0.6  81.4   7.1
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.3   0.1   0.1   1.4   0.1

dse-k8s-worker1013 seems fairly happy in terms of the original problem since we made the change yesterday. In the graphs the "Socket: utilization" is fairly steady, and on the host I see zero stuck in the fin-wait-1 state:

cmooney@dse-k8s-worker1013:~$ sudo ss -tpon state fin-wait-1
Recv-Q      Send-Q           Local Address:Port           Peer Address:Port     Process      
cmooney@dse-k8s-worker1013:~$

Still getting 0% packet loss in the MTR too. So it seems our actions have removed the packet-loss, the result being all FIN packets sent to the Ceph hosts make it and are ACKed. Which means we do not trigger the kernel TCP retransmit bug that leaves the connections in a stuck state.

Next steps

I am not entirely happy with the situation though. If - as it seems - the issue is the SFP-T modules in the swithces then:

  1. It is very troubling that out of this set of hosts over 50% of them were cabled to bad SFP modules
  2. More worrying is that we apparently have dropped packets, but we have no error counters or other metrics alerting us to the issue

SFP's do occasionally go bad. Particularly these non-optical modules. But that is a high rate, so I was wondering if maybe there was some other explanation. Might the reboot have cleared something up? Were all the hosts rebooted previously?

We can certainly replace all the other SFP's that's fine. My worry is in a wider sense if this "invisible" packet loss is also happening on 50%+ of the 1G servers we moved, and we are not detecting it on those. We only noticed it here because it triggered the bug. So I want to be 100% sure our apparent fix wasn't due to anything else before I go into full panic mode over that.

The k8s host sent a FIN to the remote side but due to the packet-loss issue the remote side didn't get it, or it did and the ACK for it wasn't received. Which explains it not moving to FIN-WAIT-2, however surely it should try to resend the FIN, and if this state persists eventually just delete the connection?

That's why I'm suggesting the kernel bug angle here -- the socket got closed in such a way that it deleted the state transition timers, so this never happens.

OK, I think that this is definitely worth investigating.

We've got two different kernel interfaces to the ceph services, which are loaded to all dse-k8s-workers.

  • ceph - which provides the cephfs filesystem services
  • rbd - which provides the rados block devices
btullis@dse-k8s-worker1013:~$ sudo lsmod|egrep '(ceph|rbd)'
ceph                  667648  6
netfs                 569344  1 ceph
rbd                   131072  0
libceph               540672  2 ceph,rbd
libcrc32c              12288  6 nf_conntrack,nf_nat,raid456,libceph,ip_vs,sctp

My assumption is that this is more likely related to the cephfs interface, than to the rbd interface.
That's because we mainly use the block devices for postgresql, which doesn't have a high pod churn rate like the airflow tasks.

Each Airflow task uses several cephfs volumes:

  • /opt/airflow/dags (ro)
  • /tmp/airflow_krb5_ccache (ro)

Any pods also running dumps_v1 also use the following cephfs volume.

  • /mnt/dumpsdata (rw)

I'll have a scan through the cephfs kernel bugs, to see if I can find anything relevant.

My assumption is that this is more likely related to the cephfs interface, than to the rbd interface.
That's because we mainly use the block devices for postgresql, which doesn't have a high pod churn rate like the airflow tasks.

Seems smart -- the trigger for the bug in the CIFS client was the netns being deleted with the in-kernel socket still open.

If a quick scan through the bugs doesn't find anything, we can maybe cook up some eBPF to find stack traces.

I have also made the following ticket regarding upgrading he 1 Gbps network connections: {T414787}

Unfortunately the problem is not solved as shown in this grafana graph

Yeah I was worried we'd see the same pattern as the graph in the task description. After a reboot it's steady for ~24h then it starts to increase again. Which indeed is what dse-k8s-worker1013 shows.

I can also see the loss shown in mtr - which went to zero after the SFP swap / reboot last week - is now evident again:

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-19T15:24:48+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   2.2%  1000    0.4   0.7   0.2  34.5   2.9
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  2.3%  1000    6.6   5.7   0.5  72.1   6.6
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               1.8%  1000    0.2   0.1   0.1   1.6   0.1

This could be due to the SFP heating up with usage, and the issue not being present immediately after a swap. But that does seem unusual with multiple modules having been tried. And still zero errors showing. From another 1G host, connected in the same block of ports on the same switch (so if the issue was thermal I'd expect to see similar issues on it), shows no loss even with 10,000 pings:

cmooney@wikikube-worker1137:~$ mtr -b -w -c 10000 -4 cephosd1001.eqiad.wmnet 
Start: 2026-01-19T19:03:56+0000
HOST: wikikube-worker1137                                 Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0% 10000    0.3   0.3   0.2  74.8   3.8
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0% 10000    3.7   3.7   0.5  95.9   7.8
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0% 10000    0.2   0.1   0.1   5.2   0.2

All-in-all I'm scratching my head as to why we see packet loss from these hosts, after a few days of operation but not immediately on a reboot. It makes me wonder if there is another explanation like something on the host side. I'm not sure if it might be possible to reimage one of these hosts into our basic "insetup" role, and then observe if we also see packet loss after a few days like this? The difference there would be the regular up-to-date kernel would be in use. It does occur to me that whatever is responsible for the kernel TCP-handling bug might also be affecting the rest of the network stack.

I guess the one element that doesn't line up there is the start seemingly corresponding to when they moved switch, but right now I'm clutching at straws to find the network issue.

Change #1229074 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Bump the 6.12 backport for Bookworm to 6.12.57

https://gerrit.wikimedia.org/r/1229074

I suggest we first move to the latest 6.12 backport to rule that this isn't a kernel issue already fixed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1229074

Change #1229074 merged by Muehlenhoff:

[operations/puppet@production] Bump the 6.12 backport for Bookworm to 6.12.57

https://gerrit.wikimedia.org/r/1229074

I suggest we first move to the latest 6.12 backport to rule that this isn't a kernel issue already fixed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1229074

New kernel is installed on the DSE workers now after forcing a Puppet run: https://debmonitor.wikimedia.org/packages/linux-image-6.12.57+deb12-amd64

I suggest we first move to the latest 6.12 backport to rule that this isn't a kernel issue already fixed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1229074

New kernel is installed on the DSE workers now after forcing a Puppet run: https://debmonitor.wikimedia.org/packages/linux-image-6.12.57+deb12-amd64

Great, thanks. I will kick off a rolling reboot of the nodes.

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1001-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1002-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1006-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1006-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1007-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1008-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1008-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1008-1019].eqiad.wmnet

Host dse-k8s-worker1008.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to investigate network issue

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1009-1019].eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1009-1019].eqiad.wmnet

this time it has worked for all hosts (or at least it seems so from network graphs)

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker1006.eqiad.wmnet

Roll-reboot of nodes in dse-eqiad cluster started by btullis:

  • dse-k8s-worker[1009-1019].eqiad.wmnet

this time it has worked for all hosts (or at least it seems so from network graphs)

Only dse-k8s-worker1006 remaining, which I'm doing now.
Unfortunately, the sre.k8s.reboot-nodes cookbook isn't very reliable for our cluster, due to the way we deploy Airflow task pods without a corresponding Deployment object.
I'll create a ticket to see if we can make this more reliable.

So looking at dse-k8s-worker1013 it has now been up for 1 day 18 hours, yet we still see no packet loss (previously based on graphs the socket increase would start pretty much 24h after the reset/reboot, so I think this is a valid time to measure).

cmooney@dse-k8s-worker1013:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-01-23T11:42:41+0000
HOST: dse-k8s-worker1013                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.4   0.7   0.2  38.9   3.0
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000    8.1   6.3   0.6 106.9   9.1
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.2   0.1   0.1   2.4   0.2

So I'm beginning to think the previous loss was potentially due to something on host and the kernel change has resolved it. But that said I cannot explain why we apparently seen an uptick on this when hosts in row C/D were moved, nor why we have not observed anything similar on 1G dse-k8s-worker hosts in rows A and B.

It seems that the dse-k8s-worker1019 still has the problem:

image.png (1×2 px, 160 KB)

With the various investigations that have happened around Airflow, do we now have a better understanding of this specific issue?

cmooney triaged this task as Medium priority.Mon, Feb 9, 4:17 PM

With the various investigations that have happened around Airflow, do we now have a better understanding of this specific issue?

I guess two things jump to mind here. Firstly we still have seen a rising number of open sockets on dse-k8s-worker1019, though it sometimes levels off a little:

image.png (525×997 px, 47 KB)

The obvious question there is if this host is running the same kernel version as the others, and from what we can tell that version does not have the TCP-stack bug that was causing this, why do we see this pattern?

As regards potential packet loss on the network, which may have resulted in this bug being triggered on other hosts, it seems to loss disappeared on the other hosts when the kernel got upgraded (i.e. we can't reproduce my mtr results from above on them any more). So that suggests the loss wasn't network side but related to the kernel receipt of packets. Right now there appears to be no loss from dse-k8s-worker1019 to an example ceph host when I check either:

cmooney@dse-k8s-worker1019:~$ mtr -4 -b -w -c 1000 cephosd1001.eqiad.wmnet
Start: 2026-02-09T15:43:49+0000
HOST: dse-k8s-worker1019                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- et-1-0-5-1019.cr1-eqiad.wikimedia.org (10.64.32.2)   0.0%  1000    0.3   0.8   0.2  39.5   3.7
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9)  0.0%  1000    5.5   5.9   0.5  64.5   7.2
  3.|-- cephosd1001.eqiad.wmnet (10.64.130.13)               0.0%  1000    0.2   0.1   0.1   1.8   0.1

So overall I'm not 100% the root cause here is the network. What would be good to understand first I think is why the kernel change that seemed to fix the bug on the other hosts does not seem to have done it on this one.