Page MenuHomePhabricator

Race condition in iptables rules during puppet runs on k8s nodes
Open, LowPublic

Description

We have been working on T372878, i.e. kubernetes nodes have been changing IP addresses. This has been triggering changes to ferm rules across the nodes of our cluster.

While looking into T374025, it was noted that many memcached errors observed by mediawiki were occurring during puppet runs that included changes to ferm rules.

After discussing with @JMeybohm, we concluded that there is brief amount of time where:

  • Ferm has recreated all iptables rules
  • Calico realises that the current iptables rules are not what it expected
  • Calico applies the missing rules
  • Errors stop

Calico logs from such an occurrence can be found here: https://logstash.wikimedia.org/goto/95a27dfbc3a90a960c53a20d4ade76bf

I assume that similar "connectivity" errors may be observed from other applications running on k8s.

Part of the problem would prolly go away with T365687, but not fully

Related Objects

StatusSubtypeAssignedTask
Resolvedjijiki
OpenNone
Resolvedakosiaris
ResolvedJhancock.wm
ResolvedNone
ResolvedJhancock.wm
DuplicateNone
DuplicateNone
ResolvedJhancock.wm
DuplicateNone
DuplicateNone
ResolvedMoritzMuehlenhoff
ResolvedJhancock.wm
InvalidNone
ResolvedPRODUCTION ERRORClement_Goubert
ResolvedJMeybohm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedNone
OpenNone

Event Timeline

@akosiaris and I had a discussion about this and it seems pretty complex to aim for replacing ferm completely with Calico HostEndpoint definitions. Instead we'll try to reduce ferm reloads as much as possible, which is probably sufficient (especially in the usual case of not changing the IPs of 150 wikikube workers during a week).

  • Set profile::firewall::defs_from_etcd: false for kubernetes workers to disable requestctl defs (which are not required/without effect on k8s workers anyways).
  • Don't restart ferm on definition changes (/etc/ferm/conf.d/00_defs) that are not used. There is code in place that should do this, but it looks like that that is not working (correctly).
  • T365687: Improve calico-typha firewall rules

This is what I currently see when puppet is fixing a manual change to 00_defs (/usr/local/sbin/ferm-status returning 0):

2024-09-16T09:03:36.789068+00:00 kubestage2001 puppet-agent[2049128]: Enabling Puppet.
2024-09-16T09:03:38.357384+00:00 kubestage2001 puppet-agent[2049131]: Using environment 'production'
2024-09-16T09:03:38.449319+00:00 kubestage2001 puppet-agent[2049131]: Retrieving pluginfacts
2024-09-16T09:03:38.511768+00:00 kubestage2001 puppet-agent[2049131]: Retrieving plugin
2024-09-16T09:03:39.115746+00:00 kubestage2001 puppet-agent[2049131]: Loading facts
2024-09-16T09:04:10.592373+00:00 kubestage2001 puppet-agent[2049131]: Caching catalog for kubestage2001.codfw.wmnet
2024-09-16T09:04:11.369212+00:00 kubestage2001 puppet-agent[2049131]: Applying configuration version '(75e0f92a77) Volans - test-cookbook: read spicerack config with sudo'
2024-09-16T09:04:14.955591+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Exec[bump nf_conntrack hash table size]/returns) executed successfully (correct
ive)
2024-09-16T09:04:17.858695+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content) 
2024-09-16T09:04:17.858876+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content) --- /etc/ferm/conf.d/0
0_defs#0112024-09-16 09:03:18.860670315 +0000
2024-09-16T09:04:17.858914+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content) +++ /tmp/puppet-file20
240916-2049131-5jwrf9#0112024-09-16 09:04:17.849199136 +0000
2024-09-16T09:04:17.858945+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content) @@ -1,4 +1,3 @@
2024-09-16T09:04:17.858972+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content) -@def $FOOOONTERNAL = 
(10.0.0.0/8 2620:0:860:100::/56 2620:0:861:100::/56 2620:0:863:100::/56 2001:df2:e500:100::/56 2a02:ec80:300:100::/56 2a02:ec80:600:100::/56 2a02:ec80:700:100::/56 2a02:ec80:ff00:10
0::/56);
2024-09-16T09:04:17.859114+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content)  
2024-09-16T09:04:17.859158+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content)  @def $INTERNAL = (10.
0.0.0/8 2620:0:860:100::/56 2620:0:861:100::/56 2620:0:863:100::/56 2001:df2:e500:100::/56 2a02:ec80:300:100::/56 2a02:ec80:600:100::/56 2a02:ec80:700:100::/56 2a02:ec80:ff00:100::/
56);
2024-09-16T09:04:17.859190+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content)  # $DOMAIN_NETWORKS is
 a set of all networks belonging to a domain.
2024-09-16T09:04:17.868950+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content) content changed '{sha2
56}8315a90b805d9eb0ef1ddaa0488f5892f2622c482c3cb0a2df85c9328fd70f19' to '{sha256}9a6ce0c530f910512052853c1b75d03488b4343421b69b5960f701f4ac67ec57' (corrective)
2024-09-16T09:04:17.869243+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]) Scheduling refresh of Service[
ferm]
2024-09-16T09:04:25.974716+00:00 kubestage2001 systemd[1]: Stopping ferm.service - ferm firewall configuration...
2024-09-16T09:04:26.128912+00:00 kubestage2001 ferm[2051373]: Stopping Firewall: ferm.
2024-09-16T09:04:26.129176+00:00 kubestage2001 systemd[1]: ferm.service: Deactivated successfully.
2024-09-16T09:04:26.129495+00:00 kubestage2001 systemd[1]: Stopped ferm.service - ferm firewall configuration.
2024-09-16T09:04:26.154240+00:00 kubestage2001 systemd[1]: Starting ferm.service - ferm firewall configuration...
2024-09-16T09:04:26.303735+00:00 kubestage2001 ferm[2051383]: Starting Firewall: ferm.
2024-09-16T09:04:26.303958+00:00 kubestage2001 systemd[1]: Finished ferm.service - ferm firewall configuration.
2024-09-16T09:04:26.305640+00:00 kubestage2001 puppet-agent[2049131]: (/Stage[main]/Ferm/Service[ferm]) Triggered 'refresh' from 1 event
2024-09-16T09:04:27.785289+00:00 kubestage2001 puppet-agent[2049131]: Applied catalog in 16.87 seconds

I think the problem is that, if ferm-status returns 0, puppet calls systemctl restart ferm:

2024-09-16T09:41:23.651078+00:00 kubestage2001 puppet-agent[2086315]: Executing: '/usr/local/sbin/ferm-status'
2024-09-16T09:41:24.002711+00:00 kubestage2001 puppet-agent[2086315]: Executing: '/usr/local/sbin/ferm-status'
2024-09-16T09:41:24.352844+00:00 kubestage2001 puppet-agent[2086315]: Executing: '/usr/bin/systemctl show --property=NeedDaemonReload -- ferm'
2024-09-16T09:41:24.367174+00:00 kubestage2001 puppet-agent[2086315]: Executing: '/usr/bin/systemctl restart -- ferm'
2024-09-16T09:41:24.414488+00:00 kubestage2001 systemd[1]: Stopping ferm.service - ferm firewall configuration...
2024-09-16T09:41:24.570567+00:00 kubestage2001 ferm[2088381]: Stopping Firewall: ferm.
2024-09-16T09:41:24.570772+00:00 kubestage2001 systemd[1]: ferm.service: Deactivated successfully.
2024-09-16T09:41:24.571013+00:00 kubestage2001 systemd[1]: Stopped ferm.service - ferm firewall configuration.
2024-09-16T09:41:24.594683+00:00 kubestage2001 systemd[1]: Starting ferm.service - ferm firewall configuration...
2024-09-16T09:41:24.747817+00:00 kubestage2001 ferm[2088392]: Starting Firewall: ferm.
2024-09-16T09:41:24.748051+00:00 kubestage2001 systemd[1]: Finished ferm.service - ferm firewall configuration.
2024-09-16T09:41:24.749237+00:00 kubestage2001 puppet-agent[2086315]: (/Service[ferm]) Triggered 'refresh' from 1 event

Change #1073233 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Don't restart(stop,start) ferm on puppet notify, use reload instead

https://gerrit.wikimedia.org/r/1073233

Change 1073233 had a related patch set uploaded (by JMeybohm; author: JMeybohm)

I think this is one piece of the puzzle (reload instead of restart ferm on notify), but I also think that ferm-status does not work as expected as stated in T374366#10147899.:
When the ferm config changes on disk, ferm-statues does not (always) detect this and returns 0, albeit the change is visible in ferm -nl --domain ip /etc/ferm/ferm.conf (which is called by ferm-status to compute the expected state). I tried this with changing the DOMAIN_NETWORKS definition in /etc/ferm/conf.d/00_defs as well as changing a to be accepted port (in /etc/ferm/conf.d/10_ssh_from_cumin_masters to be precise).

root@kubestage2001:~# ferm -nl --domain ip /etc/ferm/ferm.conf > /tmp/ferm.before
root@kubestage2001:~# vim /etc/ferm/conf.d/00_defs
root@kubestage2001:~# vim /etc/ferm/conf.d/10_ssh_from_cumin_masters 
root@kubestage2001:~# ferm -nl --domain ip /etc/ferm/ferm.conf > /tmp/ferm.after
root@kubestage2001:~# diff -Nur /tmp/ferm.before /tmp/ferm.after
--- /tmp/ferm.before    2024-09-16 14:45:41.108598188 +0000
+++ /tmp/ferm.after     2024-09-16 14:47:34.374513867 +0000
@@ -1,4 +1,4 @@
-# Generated by ferm 2.5.1 (iptables-legacy-save) on Mon Sep 16 14:45:41 2024
+# Generated by ferm 2.5.1 (iptables-legacy-save) on Mon Sep 16 14:47:34 2024
 *filter
 :FORWARD ACCEPT [0:0]
 :INPUT DROP [0:0]
@@ -44,7 +44,7 @@
 -A INPUT --protocol tcp --dport 5473 --source 10.192.32.101 --jump ACCEPT
 -A INPUT --protocol tcp --dport 5473 --source 10.192.48.32 --jump ACCEPT
 -A INPUT --protocol tcp --dport 5473 --source 10.192.8.15 --jump ACCEPT
--A INPUT --protocol tcp --dport 15001 --source 10.128.0.0/24 --jump ACCEPT
+-A INPUT --protocol tcp --dport 15001 --source 111.128.0.0/24 --jump ACCEPT
 -A INPUT --protocol tcp --dport 15001 --source 10.132.0.0/24 --jump ACCEPT
 -A INPUT --protocol tcp --dport 15001 --source 10.136.0.0/24 --jump ACCEPT
 -A INPUT --protocol tcp --dport 15001 --source 10.136.1.0/24 --jump ACCEPT
@@ -200,8 +200,8 @@
 -A INPUT --protocol tcp --dport 22 --source 198.35.26.12 --jump ACCEPT
 -A INPUT --protocol tcp --dport 22 --source 208.80.153.110 --jump ACCEPT
 -A INPUT --protocol tcp --dport 22 --source 208.80.155.110 --jump ACCEPT
--A INPUT --protocol tcp --dport 22 --source 10.64.48.98 --jump ACCEPT
--A INPUT --protocol tcp --dport 22 --source 10.192.32.49 --jump ACCEPT
+-A INPUT --protocol tcp --dport 922 --source 10.64.48.98 --jump ACCEPT
+-A INPUT --protocol tcp --dport 922 --source 10.192.32.49 --jump ACCEPT
 -A INPUT --protocol udp --destination 255.255.255.255 --sport 67 --dport 68 --jump DROP
 -A INPUT --jump NFLOG --match limit --limit 1/second --limit-burst 5 --nflog-prefix [fw-in-drop]
 COMMIT
root@kubestage2001:~# /usr/local/sbin/ferm-status; echo $?
0
root@kubestage2001:~#

Change #1073233 merged by JMeybohm:

[operations/puppet@production] Don't restart(stop,start) ferm on puppet notify, use reload instead

https://gerrit.wikimedia.org/r/1073233

With 1073233 merged, ferm is correctly reloaded (not stopped/started) on notify:

2024-09-17T09:23:29.290947+00:00 kubestage2001 puppet-agent[3410499]: (/Stage[main]/Profile::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]) Scheduling refresh of Service[ferm]
2024-09-17T09:23:37.006826+00:00 kubestage2001 systemd[1]: Reloading ferm.service - ferm firewall configuration...
2024-09-17T09:23:37.167555+00:00 kubestage2001 ferm[3412549]: Reloading Firewall configuration....
2024-09-17T09:23:37.167781+00:00 kubestage2001 systemd[1]: Reloaded ferm.service - ferm firewall configuration.
2024-09-17T09:23:37.169263+00:00 kubestage2001 puppet-agent[3410499]: (/Stage[main]/Ferm/Service[ferm]) Triggered 'refresh' from 1 event
2024-09-17T09:23:38.380945+00:00 kubestage2001 puppet-agent[3410499]: Applied catalog in 15.79 seconds

Change #1073760 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Fix ferm_status to actually compare rules

https://gerrit.wikimedia.org/r/1073760

Change #1073859 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube: Disable requestctl ferm rules and definitions

https://gerrit.wikimedia.org/r/1073859

Change #1073760 merged by JMeybohm:

[operations/puppet@production] Fix ferm_status to actually compare rules

https://gerrit.wikimedia.org/r/1073760

Change #1074113 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] ferm: Allow to specify a different ferm-status command to use

https://gerrit.wikimedia.org/r/1074113

Change #1073859 merged by JMeybohm:

[operations/puppet@production] wikikube: Disable requestctl ferm rules and definitions

https://gerrit.wikimedia.org/r/1073859

Change #1074113 merged by JMeybohm:

[operations/puppet@production] ferm: Allow to specify a different ferm-status command to use

https://gerrit.wikimedia.org/r/1074113

Change #1074155 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] profile::firewall: Absent confd config when it is disabled

https://gerrit.wikimedia.org/r/1074155

Change #1074155 merged by JMeybohm:

[operations/puppet@production] profile::firewall: Absent confd config when it is disabled

https://gerrit.wikimedia.org/r/1074155

Change #1074185 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] ferm: Use ferm-status to start ferm on diffs

https://gerrit.wikimedia.org/r/1074185

Fixing ferm_status.py is still not enough. When puppet corrects an on disk ferm config change (which has not been applied to iptables) back to the previous state, it does still reload ferm although ferm-status does return 0. Also the confd related code (requestctl rules) was restarting ferm via systemctl directly, bypassing the puppet service hack completely.

I've added code to ferm_status.py now that enables it start ferm on it's own when a diff is detected. That can now be used as ExecReload script for the systemd service so the puppet service hack is no longer required and manual reloads of ferm behave the same as puppet reloads.

Change #1074185 merged by JMeybohm:

[operations/puppet@production] ferm: Use ferm-status to start ferm on diffs

https://gerrit.wikimedia.org/r/1074185

Change #1074371 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] ferm: Fix systemd override to not append ExecReload

https://gerrit.wikimedia.org/r/1074371

Change #1074371 merged by JMeybohm:

[operations/puppet@production] ferm: Fix systemd override to not append ExecReload

https://gerrit.wikimedia.org/r/1074371

Change #1074404 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] ferm: Use ferm-status to restart ferm on wikikube-staging

https://gerrit.wikimedia.org/r/1074404

Change #1074405 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] ferm: Make reload via ferm-status the default

https://gerrit.wikimedia.org/r/1074405

Change #1074404 merged by JMeybohm:

[operations/puppet@production] ferm: Use ferm-status to restart ferm on wikikube-staging

https://gerrit.wikimedia.org/r/1074404

Change #1074405 merged by JMeybohm:

[operations/puppet@production] ferm: Make reload via ferm-status the default

https://gerrit.wikimedia.org/r/1074405

Change #1075841 had a related patch set uploaded (by JMeybohm; author: Muehlenhoff):

[operations/puppet@production] Don't pass ferm_status_restart in firewall class

https://gerrit.wikimedia.org/r/1075841

Change #1075841 merged by Muehlenhoff:

[operations/puppet@production] Don't pass ferm_status_restart in firewall class

https://gerrit.wikimedia.org/r/1075841

I don't see any unexpected out of sync events from calico anymore since merging this. While still not ideal I will resolve this, hoping for a proper solution with nftables support in kube-proxy in the future.

Re-opening, as it is partially the reason of T371881

JMeybohm lowered the priority of this task from High to Low.Sep 8 2025, 2:59 PM
JMeybohm raised the priority of this task from Low to High.Mar 27 2026, 2:00 PM

Bumping priority of this again since we found another corner case where this bites us in subtle ways

Change #1272448 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube: Allow access to typha from DOMAIN_NETWORKS

https://gerrit.wikimedia.org/r/1272448

Change #1272448 merged by JMeybohm:

[operations/puppet@production] wikikube: Allow access to typha from DOMAIN_NETWORKS

https://gerrit.wikimedia.org/r/1272448

JMeybohm lowered the priority of this task from High to Low.Apr 16 2026, 8:48 AM
JMeybohm changed the status of subtask T365687: Improve calico-typha firewall rules from Stalled to Open.

We've deployed yet another "improvement" hoping to resolve this. Lowering priority in good faith