A lot of warnings in the past days for:
19:07 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1014 is CRITICAL: CRITICAL: nf_conntrack is 90 % full 19:07 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1012 is CRITICAL: CRITICAL: nf_conntrack is 90 % full 19:08 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1013 is CRITICAL: CRITICAL: nf_conntrack is 90 % full 19:08 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1022 is CRITICAL: CRITICAL: nf_conntrack is 90 % full 19:08 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1018 is CRITICAL: CRITICAL: nf_conntrack is 90 % full 19:09 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1020 is CRITICAL: CRITICAL: nf_conntrack is 90 % full ... ...
On one kafka host:
elukey@kafka1020:~$ sudo sysctl net.netfilter.nf_conntrack_max net.netfilter.nf_conntrack_max = 262144 elukey@kafka1020:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_count = 233136
Output of less /proc/net/nf_conntrack:
.... .... ipv4 2 tcp 6 73 TIME_WAIT src=10.64.16.127 dst=10.64.53.12 sport=47835 dport=9092 src=10.64.53.12 dst=10.64.16.127 sport=9092 dport=47835 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 51 TIME_WAIT src=10.64.48.33 dst=10.64.53.12 sport=42706 dport=9092 src=10.64.53.12 dst=10.64.48.33 sport=9092 dport=42706 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 37 TIME_WAIT src=10.64.32.65 dst=10.64.53.12 sport=38930 dport=9092 src=10.64.53.12 dst=10.64.32.65 sport=9092 dport=38930 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 17 TIME_WAIT src=10.64.16.125 dst=10.64.53.12 sport=47311 dport=9092 src=10.64.53.12 dst=10.64.16.125 sport=9092 dport=47311 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 16 TIME_WAIT src=10.64.48.62 dst=10.64.53.12 sport=59539 dport=9092 src=10.64.53.12 dst=10.64.48.62 sport=9092 dport=59539 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 12 TIME_WAIT src=10.64.48.63 dst=10.64.53.12 sport=38423 dport=9092 src=10.64.53.12 dst=10.64.48.63 sport=9092 dport=38423 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 110 TIME_WAIT src=10.64.32.64 dst=10.64.53.12 sport=48972 dport=9092 src=10.64.53.12 dst=10.64.32.64 sport=9092 dport=48972 [ASSURED] mark=0 zone=0 use=2 ipv4 2 tcp 6 69 TIME_WAIT src=10.64.16.96 dst=10.64.53.12 sport=37687 dport=9092 src=10.64.53.12 dst=10.64.16.96 sport=9092 dport=37687 [ASSURED] mark=0 zone=0 use=2 ... ...
Most of the sockets in TIME_WAITs are for connections between mwXXXX and kafka. Netstat shows a different result (-tuap), a lot of ESTABLISHED connections between kafka hosts and with cpXXXX (as expected).
This is only affecting the Analytics cluster, not the EventBus one. The risk is dropping packets with when nf_conntrack reaches 100% utilization.
Quick solution would be to increase the conntrack_max value with sysctl (like https://gerrit.wikimedia.org/r/#/c/278290/1) but it would be also great to figure out what changed.