Page MenuHomePhabricator

MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100
Closed, ResolvedPublic

Description

Common information

  • alertname: MaxConntrack
  • cluster: wmcs
  • instance: cloudvirt1067:9100
  • job: node
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Event Timeline

This is again diffscan02 (172.16.3.44), similar to this issue from one year ago: T355222: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1043:9100.

root@cloudvirt1067:~# conntrack -L |grep 172.16.3.44 |wc -l
conntrack v1.4.7 (conntrack-tools): 365091 flow entries have been shown.
283593

Screenshot 2025-07-09 at 15.51.23.png (1×1 px, 480 KB)

fnegri triaged this task as High priority.Jul 9 2025, 2:11 PM

Maybe there are other VMs that are also creating more connections than usual. Top ones (hat tip @dcaro for the Bash one-liner!):

root@cloudvirt1067:~# conntrack -L | grep -o 'src=[^ ]*' | sort | uniq -c | sort -n | tail -n 10
conntrack v1.4.7 (conntrack-tools): 377292 flow entries have been shown.
  10721 src=116.202.109.224 (s1.dsdchosting.gr.)
  11294 src=10.100.2.222 (not found)
  11400 src=172.20.2.3 (cloudnet1006.private.eqiad.wikimedia.cloud.)
  12452 src=10.100.132.90 (not found)
  14368 src=10.100.13.130 (not found)
  14439 src=10.100.13.129 (not found)
  17240 src=172.20.3.20 (cloudvirt1062.private.eqiad.wikimedia.cloud.)
  20285 src=172.16.17.39 (paws-127b-rpchztfjt2jb-node-0.paws.eqiad1.wikimedia.cloud.)
  29063 src=172.20.4.21 (cloudvirt1067.private.eqiad.wikimedia.cloud.)
 283819 src=172.16.3.44 (diffscan02.automation-framework.eqiad1.wikimedia.cloud.)

It looks like the pattern is the same as the previous days, but since last night there are some additional 50K connections that pushed the total just above the alert threshold.

Screenshot 2025-07-09 at 16.49.42.png (1×1 px, 322 KB)

Maybe the PAWS worker listed above?

Actually the traffic increase (about 50k additional connections) seems to match the sum of the values for 172.20.4.21 (cloudvirt1067.private) and 172.20.3.20 (cloudvirt1062.private). This is traffic flowing between those two hosts on port 4789.

Some samples:

root@cloudvirt1067:~# conntrack -L |grep 172.20.3.20 |head -n 10
udp      17 15 src=172.20.4.21 dst=172.20.3.20 sport=36666 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=36666 mark=0 use=1
udp      17 12 src=172.20.3.20 dst=172.20.4.21 sport=53982 dport=4789 [UNREPLIED] src=172.20.4.21 dst=172.20.3.20 sport=4789 dport=53982 mark=0 use=1
udp      17 5 src=172.20.4.21 dst=172.20.3.20 sport=33432 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=33432 mark=0 use=1
udp      17 2 src=172.20.4.21 dst=172.20.3.20 sport=48639 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=48639 mark=0 use=1
udp      17 16 src=172.20.4.21 dst=172.20.3.20 sport=57646 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=57646 mark=0 use=1
udp      17 26 src=172.20.4.21 dst=172.20.3.20 sport=39935 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=39935 mark=0 use=1
udp      17 29 src=172.20.3.20 dst=172.20.4.21 sport=58463 dport=4789 [UNREPLIED] src=172.20.4.21 dst=172.20.3.20 sport=4789 dport=58463 mark=0 use=1
udp      17 4 src=172.20.4.21 dst=172.20.3.20 sport=41420 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=41420 mark=0 use=2
udp      17 27 src=172.20.4.21 dst=172.20.3.20 sport=56892 dport=4789 [UNREPLIED] src=172.20.3.20 dst=172.20.4.21 sport=4789 dport=56892 mark=0 use=1
udp      17 26 src=172.20.3.20 dst=172.20.4.21 sport=51512 dport=4789 [UNREPLIED] src=172.20.4.21 dst=172.20.3.20 sport=4789 dport=51512 mark=0 use=1

This is traffic flowing between those two hosts on port 4789.

This is probably encapsulated VxLAN traffic: https://en.wikipedia.org/wiki/Virtual_Extensible_LAN

fnegri changed the task status from Open to In Progress.Jul 9 2025, 3:49 PM
fnegri claimed this task.

Whatever that is, 50k connections is only 10% of the limit, so we should focus on what's causing the remaining 90%.

I think it's diffscan02, but to verify that assumption we need to wait until the next spike. The tallest ones are every 24 hours between 00:00 UTC and 01:00 UTC. I started a tmux session on cloudvirt1067 running the following command every 10 mins:

while true; do date; conntrack -L | grep -o "src=[^ ]*" | sort | uniq -c | sort -n | tail -n 10; sleep 600; done

Tomorrow I should be able to verify if the midnight spike is indeed caused by diffscan02.

<scope creep warning>might be something we can put in prometheus xd </scope creep warning>

It's indeed diffscan02 (172.16.3.44) causing the spike around 00:00 UTC:

Wed Jul  9 11:57:44 PM UTC 2025
conntrack v1.4.7 (conntrack-tools): 13424 flow entries have been shown.
    510 src=172.16.5.11
    537 src=10.100.132.94
    602 src=10.100.47.65
    641 src=172.20.255.1
    735 src=172.20.2.3
    977 src=172.16.19.232
   1412 src=172.20.4.21
   1522 src=172.16.17.39
   2495 src=172.16.3.246
   4456 src=10.64.149.23
Thu Jul 10 12:07:44 AM UTC 2025
conntrack v1.4.7 (conntrack-tools): 420878 flow entries have been shown.
   6052 src=172.20.2.32
   6063 src=185.15.56.237
   6063 src=185.15.56.252
   6072 src=172.20.3.23
   6078 src=185.15.56.248
   6078 src=185.15.57.25
   6079 src=185.15.56.161
   6079 src=185.15.56.249
   6094 src=185.15.57.21
 406827 src=172.16.3.44

The alert started firing daily on 2025-06-27 because the limit changed:

Screenshot 2025-07-10 at 11.59.40.png (1×1 px, 217 KB)

The value for nf_conntrack_max was increased in T355222: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1043:9100 and again in T373816: Cloud VPS: investigate conntrack table usage on cloudvirt1050. The current value set in modules/profile/manifests/openstack/base/nova/compute/service.pp is 33554432, but if I check /proc/sys/net/netfilter/nf_conntrack_max I see a different value:

root@cloudvirt1067:~# cat /etc/sysctl.d/70-nova_conntrack.conf
# sysctl parameters managed by Puppet.
net.netfilter.nf_conntrack_buckets = 8388608
net.netfilter.nf_conntrack_max = 33554432
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 65

root@cloudvirt1067:~# cat /proc/sys/net/netfilter/nf_conntrack_max
524288

Grafana shows the value changed on all cloudvirts, but not at the same time. I think the value failed to be reapplied after the latest reboot of cloudvirts:

root@cloudvirt1067:~# journalctl -t systemd-sysctl
Jun 26 02:54:47 cloudvirt1067 systemd-sysctl[951]: Couldn't write '8388608' to 'net/netfilter/nf_conntrack_buckets', ignoring: No such file or directory
Jun 26 02:54:47 cloudvirt1067 systemd-sysctl[951]: Couldn't write '33554432' to 'net/netfilter/nf_conntrack_max', ignoring: No such file or directory
Jun 26 02:54:47 cloudvirt1067 systemd-sysctl[951]: Couldn't write '65' to 'net/netfilter/nf_conntrack_tcp_timeout_time_wait', ignoring: No such file or directory

This is probably https://github.com/systemd/systemd/issues/1113

The last change to the value was actually in T387179: MaxConntrack Max conntrack at 90.6% on cloudvirt1039:9100 with patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124821 setting the value to 33554432, which is what I see in /etc/sysctl.d/70-nova_conntrack.conf and what I also see in Grafana until the latest reboots.

sysctl --system does fix the issue:

root@cloudvirt1067:~# cat /proc/sys/net/nf_conntrack_max
524288

root@cloudvirt1067:~# sysctl --system

root@cloudvirt1067:~# cat /proc/sys/net/nf_conntrack_max
33554432

So it is probably some form of race condition, similar to T136094: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait that I think was fixed only for hosts running ferm, and cloudvirts don't have ferm.

I created a subtask T399212: nf_conntrack_max is not set at boot in cloudvirts to address the root cause of the alert described by this task.

fnegri closed this task as Resolved.EditedJul 16 2025, 9:57 AM

The alert stopped firing after I updated the nf_conntrack_max value for cloudvirt1067 in T399212#10992507

In T399212: nf_conntrack_max is not set at boot in cloudvirts I also fixed the issue where the setting was not applied correctly at boot, so the correct setting should persist after future reboots.