Page MenuHomePhabricator

Puppet doesn't restart ferm on failure
Closed, ResolvedPublic

Description

After the reboot of some eqsin hosts as described in the parent task, ferm failed to start on bast5001 and dns5001 due to a DNS lookup failure. The problem is that Puppet didn't restart it in any of the following runs, I had to manually do sudo systemctl start ferm and as a result the 2 hosts have been for ~1h without ferm rules applied.

We should improve this behaviour to ensure to minimize the potential exposure of a host without ferm rules applied on reboot, if ferm fails to start for any transient reason.

Event Timeline

Volans triaged this task as Medium priority.Oct 14 2018, 5:44 PM
Volans created this task.

I don't know if this is related but today I noticed that, if iptables rules are cleared (iptables -F), subsequent puppet runs will not re-apply them. I also had to run systemctl restart ferm to get them re-applied.

Change 573335 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ferm: add a very basic status check

https://gerrit.wikimedia.org/r/573335

Change 576101 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ferm: Add status check

https://gerrit.wikimedia.org/r/576101

Change 576102 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ferm: enable ferm status script

https://gerrit.wikimedia.org/r/576102

Change 589289 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/cookbooks@master] sre.wdqs.data-transfer: manage ferm rules required for transfer

https://gerrit.wikimedia.org/r/589289

Change 576101 merged by Jbond:
[operations/puppet@production] ferm: Add status check

https://gerrit.wikimedia.org/r/576101

Change 591036 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ferm_status: store each chain in its own hash

https://gerrit.wikimedia.org/r/591036

Change 591036 merged by Jbond:
[operations/puppet@production] ferm_status: store each chain in its own hash

https://gerrit.wikimedia.org/r/591036

Change 591049 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ferm_status: use ip_network to normalise src and dst addresses

https://gerrit.wikimedia.org/r/591049

Change 591049 merged by Jbond:
[operations/puppet@production] ferm_status: use ip_network to normalise src and dst addresses

https://gerrit.wikimedia.org/r/591049

Change 573335 abandoned by Jbond:
ferm: add a very basic status check

Reason:
we now have ferm-status.py

https://gerrit.wikimedia.org/r/573335

Change 576102 merged by Jbond:
[operations/puppet@production] ferm: enable ferm status script

https://gerrit.wikimedia.org/r/576102

Change 589289 merged by Ryan Kemper:
[operations/cookbooks@master] sre.wdqs.data-transfer: manage ferm rules required for transfer

https://gerrit.wikimedia.org/r/589289

Change 595061 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/cookbooks@master] sre.wdqs.data-transfer: fix missing commas

https://gerrit.wikimedia.org/r/595061

Change 595649 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/cookbooks@master] sre.wdqs.data-transfer: fix syntax, simplify rule

https://gerrit.wikimedia.org/r/595649

Change 595061 merged by Gehel:
[operations/cookbooks@master] sre.wdqs.data-transfer: fix syntax, simplify rule

https://gerrit.wikimedia.org/r/595061

Change 596073 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/cookbooks@master] sre.wdqs.data-transfer: use proper systemctl path

https://gerrit.wikimedia.org/r/596073

Change 596073 merged by Ryan Kemper:
[operations/cookbooks@master] sre.wdqs.data-transfer: use proper systemctl path

https://gerrit.wikimedia.org/r/596073

Change 595649 abandoned by Ryan Kemper:
sre.wdqs.data-transfer: fix syntax, simplify rule

Reason:
this is a "duplicate", a different gerrit patch of mine fixed this

https://gerrit.wikimedia.org/r/595649

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

akosiaris subscribed.

Removing SRE, this has already been triaged to a specific subteam.

Vgutierrez assigned this task to jbond.
Vgutierrez subscribed.

This is actually already fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/596157 (commit missed the Bug tag)