Context:
- This seems related to the work having happened in T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs
- We don't think this problem has an impact on our services as of now.
- This issue has been discovered during investigations around T412003: Airflow-main scheduler loop sometimes slows down markedly (T409924, T409800, T411988)
Since November 24th / 25th we have observed an increasing amount of TCP-inuse sockets being reported on eqiad hosts in row C & D:
- dse-k8s-worker1010.eqiad.wmnet (row-d)
- dse-k8s-worker1013.eqiad.wmnet (row-c)
- dse-k8s-worker1018.eqiad.wmnet (row-d)
- dse-k8s-worker1019.eqiad.wmnet (row-c)
There are unaffected other dse-k8s-eqiad hosts on row C & D:
- dse-k8s-worker1003.eqiad.wmnet (row-c)
- dse-k8s-worker1004.eqiad.wmnet (row-d)
- dse-k8s-worker1011.eqiad.wmnet (row-c)
All the connections are in FIN_WAIT or CLOSING status, and all are directed to cephosd hosts.
brouberol@dse-k8s-worker1019:~$ sudo netstat -laputen | grep -Pe "(FIN_WAIT|CLOSING)" | awk '{ print $5 }' | cut -d: -f 1 | sort | uniq -c
2125 10.64.130.13
1549 10.64.131.21
1592 10.64.132.23
1518 10.64.134.12
1695 10.64.135.21
brouberol@dse-k8s-worker1019:~$ sudo netstat -laputen | grep CLOSING |head | awk '{ print $5 }' | cut -d: -f 1 | sort | uniq | xargs -n1 host
13.130.64.10.in-addr.arpa domain name pointer cephosd1001.eqiad.wmnet.
21.131.64.10.in-addr.arpa domain name pointer cephosd1002.eqiad.wmnet.
23.132.64.10.in-addr.arpa domain name pointer cephosd1003.eqiad.wmnet.
12.134.64.10.in-addr.arpa domain name pointer cephosd1004.eqiad.wmnet.
21.135.64.10.in-addr.arpa domain name pointer cephosd1005.eqiad.wmnet.It is interesting to note that for host dse-k8s-worker1010.eqiad.wmnet a reboot has not solved the issue.


