Page MenuHomePhabricator

ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy
Closed, ResolvedPublic

Description

On Kubernetes workers ferm sometimes fails to restart (these restarts are e.g. triggered by Puppet if a central Ferm macro gets updated). One example:

Jan 03 19:30:08 mw1465 systemd[1]: Stopped ferm firewall configuration.
Jan 03 19:30:08 mw1465 systemd[1]: Starting ferm firewall configuration...
Jan 03 19:30:08 mw1465 ferm[1868235]: Starting Firewall: ferm
Jan 03 19:30:08 mw1465 ferm[1868274]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Jan 03 19:30:08 mw1465 ferm[1868238]: Failed to run /usr/sbin/iptables-legacy-restore
Jan 03 19:30:08 mw1465 ferm[1868238]: Firewall rules rolled back.
Jan 03 19:30:08 mw1465 ferm[1868281]:  failed!
Jan 03 19:30:08 mw1465 systemd[1]: ferm.service: Main process exited, code=exited, status=1/FAILURE
Jan 03 19:30:08 mw1465 systemd[1]: ferm.service: Failed with result 'exit-code'.
Jan 03 19:30:08 mw1465 systemd[1]: Failed to start ferm firewall configuration.
Jan 08 15:18:08 mw1465 systemd[1]: Starting ferm firewall configuration...

These do not recover automatically with the subsequent Puppet run, apparently because this error condition does not get detected by ferm-status.

We could explore whether there's a way to pass -w to iptables-save/iptables-restore via Ferm (from a quick look that doesn't exist, needs a closer look at the sources)

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-01-23T12:17:11Z] <claime> Restarting ferm.service on k8s node mw1495.eqiad.wmnet - T354855

Mentioned in SAL (#wikimedia-operations) [2024-01-25T11:26:43Z] <claime> Restarting ferm.service on k8s node kubernetes2036.codfw.wmnet - T354855

Mentioned in SAL (#wikimedia-operations) [2024-01-29T13:26:33Z] <claime> Restarting ferm.service on k8s node kubernetes2055 - T354855

Mentioned in SAL (#wikimedia-operations) [2024-02-02T12:16:00Z] <claime> Restarting ferm.service on k8s node mw1424 - T354855

Mentioned in SAL (#wikimedia-operations) [2024-02-21T14:08:25Z] <claime> restarted ferm.service on kubernetes2055.codfw.wmnet mw2440.codfw.wmnet mw2297.codfw.wmnet kubernetes2016.codfw.wmnet - T354855

Change 1005978 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] ferm: Check ferm.service status in ferm_status.py

https://gerrit.wikimedia.org/r/1005978

Mentioned in SAL (#wikimedia-operations) [2024-03-04T11:47:54Z] <claime> Disabling puppet on C:profile::firewall::log::ferm to deploy 1005978 - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T11:48:56Z] <claime> Disregard previous puppet disable message, waiting a bit T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:22:51Z] <claime> Disabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855

Change 1005978 merged by Clément Goubert:

[operations/puppet@production] ferm: Check ferm.service status in ferm_status.py

https://gerrit.wikimedia.org/r/1005978

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:28:32Z] <claime> Enabling puppet on kubernetes2019 to test new ferm_status.py - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:30:53Z] <claime> Enabling puppet on mw2322 to test new ferm_status.py - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:33:11Z] <claime> Enabling puppet on puppetboard2003 to test new ferm_status.py - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:38:06Z] <claime> Re-enabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855

Clement_Goubert claimed this task.
Clement_Goubert subscribed.

Deployed, puppet now restarts ferm.service if the systemd unit's status is failed.