Page MenuHomePhabricator

Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait
Closed, ResolvedPublic

Description

We're setting the sysctl values "net.netfilter.nf_conntrack_tcp_timeout_time_wait" and "net.netfilter.nf_conntrack_max" in /etc/sysctl.d/70-ferm_conntrack.conf (configured via base::firewall). net.netfilter.nf_conntrack_max" is realiably set, but in 400 out out approx. 1000 systems with that setting, net.netfilter.nf_conntrack_tcp_timeout_time_wait is at the kernel default value of 120. This affects both systems using upstart and systemd and seems to be caused by a race condition, depending on whether the nf_conntrack kernel module is loaded before or after the sysctl value is set.

Event Timeline

So the problem occurs whenever /etc/sysctl.d/70-ferm_conntrack.conf is processed before ferm has been started (which loads the nf_conntrack kernel module). Before the kernel module is loaded, the sysctl setting is unavailable, which makes sysctl fail and only print

sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait: No such file or directory

When the kernel module is loaded later, it uses the default of 120. On jessie sysctl settings are loaded via systemd-sysctl.service which loads all sysctl settings in general from /etc/sysctl.d, so having it depend on ferm.service being up won't really fly.

So it's probably best to set the ferm-related sysctl settings in a script which is run after ferm is started, e.g. by creating a ferm-sysctl.service which depends on ferm.service.

I've written a separate systemd unit ferm-sysctl.service (which is started after ferm itself), which sets the correct values. After some tests on multatuli this works fine. This still needs to be puppetised. And I need to look into whether the problem also happens on trusty.

Also see T148986, which describes a different boot-time race with ferm.

This also affects trusty hosts. I'll also make the net.netfilter.nf_conntrack_max value configurable via Hiera.

Change 320197 had a related patch set uploaded (by Muehlenhoff):
Load connection tracking sysctl values via a separate systemd unit

https://gerrit.wikimedia.org/r/320197

Change 320590 had a related patch set uploaded (by Muehlenhoff):
Configure connection tracking sysctl settings in ferm

https://gerrit.wikimedia.org/r/320590

Change 320590 abandoned by Muehlenhoff:
Configure connection tracking sysctl settings in ferm

Reason:
That did not work out as expected

https://gerrit.wikimedia.org/r/320590

Change 349193 had a related patch set uploaded (by Muehlenhoff):
[operations/puppet@production] Load nf_conntrack via /etc/modules-load.d/

https://gerrit.wikimedia.org/r/349193

Change 349193 merged by Muehlenhoff:
[operations/puppet@production] Load nf_conntrack via /etc/modules-load.d/

https://gerrit.wikimedia.org/r/349193

Change 349392 had a related patch set uploaded (by Muehlenhoff):
[operations/puppet@production] Load nf_conntrack via /etc/modules-load.d/

https://gerrit.wikimedia.org/r/349392

Change 349392 merged by Muehlenhoff:
[operations/puppet@production] Load nf_conntrack via /etc/modules-load.d/

https://gerrit.wikimedia.org/r/349392

Change 320197 abandoned by Muehlenhoff:
Load connection tracking sysctl values via a separate systemd unit

Reason:
Abandon in favour of https://gerrit.wikimedia.org/r/#/c/319071/ which loads the nf_conntrack module via /etc/modules-load.d

https://gerrit.wikimedia.org/r/320197

That's now fixed by loading the nf_conntrack module via /etc/modules-load.d (which is done before systemd-sysctl.service runs), which fixes the race.

Mentioned in SAL (#wikimedia-operations) [2017-04-29T10:50:18Z] <elukey> set sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to kafka[1018,1020,1022].eqiad.wmnet (was 120 - maybe related to T136094 ?)

Despite what's documented in the sysctl.d(5) manpage, this does not fix the race; kafka1018 was rebooted two hours ago, has nf_conntrack loaded via /etc/modules-load.d/conntrack.conf, but the sysctl value still isn't correctly applied.

The modules-load.d approach mentioned in sysctl.d isn't sufficiently race-free: While systemd-sysctl.service has a "After: systemd-modules-load", systemd-modules-load only initiates the loading of the kernel modules via kmod, but doesn't wait until the modules are loaded. For confirmation I've run both service units in debug mode:

May 29 15:05:54 multatuli systemd-modules-load[244]: apply: /etc/modules-load.d/conntrack.conf
May 29 15:05:54 multatuli systemd-modules-load[244]: load: nf_conntrack
May 29 15:05:54 multatuli systemd-modules-load[244]: Inserted module 'nf_conntrack'
May 29 15:05:54 multatuli systemd-modules-load[244]: apply: /etc/modules-load.d/modules.conf
May 29 15:05:54 multatuli systemd-sysctl[250]: parse: /etc/sysctl.d/10-ubuntu-defaults.conf
May 29 15:05:54 multatuli systemd-sysctl[250]: parse: /etc/sysctl.d/60-wikimedia-base.conf
May 29 15:05:54 multatuli systemd-sysctl[250]: parse: /etc/sysctl.d/70-core_dumps.conf
May 29 15:05:54 multatuli systemd-sysctl[250]: parse: /etc/sysctl.d/70-disable_unprivileged_bpf.conf
May 29 15:05:54 multatuli systemd-sysctl[250]: parse: /etc/sysctl.d/70-ferm_conntrack.conf
May 29 15:05:54 multatuli systemd-sysctl[250]: Setting 'fs/protected_hardlinks' to '1'
(..)
May 29 15:05:54 multatuli systemd-sysctl[250]: Setting 'net/netfilter/nf_conntrack_max' to '262144'
May 29 15:05:54 multatuli systemd-sysctl[250]: Setting 'net/netfilter/nf_conntrack_tcp_timeout_time_wait' to '65'
May 29 15:05:54 multatuli systemd-sysctl[250]: Failed to write '65' to '/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait': No such file or directory

Mentioned in SAL (#wikimedia-operations) [2017-08-07T09:06:33Z] <elukey> set net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 (was 120) on all the analytics kafka brokers - T136094

Mentioned in SAL (#wikimedia-operations) [2017-10-24T15:39:16Z] <elukey> set net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to mw[1308-1311] - T136094

@MoritzMuehlenhoff: All patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task assignee. (You can change the task status via Add Action...Change Status in the dropdown menu.)

I think this bug can be closed in fact.