Page MenuHomePhabricator

Initial ferm setup is disruptive
Closed, DeclinedPublic

Description

Yesterday when we were enabling ferm on an initial set of elasticsearch nodes we ran into the problem that one of the nodes was dropped from the cluster. We tracked that down to the startup behaviour how ferm is enabled via puppet:

When the "package {'ferm} ensure => present" declaration kicks in, the Debian package get's installed with the default policy of DROP in the input chain. Only after that the actual ferm configuration files are processed which configure the permitted ports. Such a puppet run is captured in https://phabricator.wikimedia.org/P1930

I see two ways to fix that:

  1. Rebuild the ferm package with a default policy of ACCEPT (which we implicitly have for all systems not ferm-enabled anyway), after puppet has configued the ferm services/rules the default configuration is gone anyway.
  1. Modify the puppet configuration for /etc/ferm/ferm.conf, /etc/ferm/functions.conf, /etc/ferm/conf.d and /etc/default/ferm in a way that the files are generated before the ferm package is installed (and we also need to ensure the the configuration files provided during package installation are correctly overwritten by the puppetised files)

I'm leaning towards 1., since it's a straightforward fix, but I'd like to hear further opinions. (Ideally the ferm package would allow a custom default config via debconf)

Event Timeline

MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)

Fixing the default config will only limit the window; there's still a window between loading /etc/ferm/conf.d/00_main (which sets up the DROP policy) and the later rules which enable the permitted services.

Maybe this is another option?

  1. On any test host, the puppet compiler or something, run the puppet class that generates our ferm rules before applying it on the actual production host. Then go to the actual production host, manually create /etc/ferm/conf.d/ and drop the generated config files. Only after that apply the change that adds base::firewall.

But talking on IRC Moritz already pointed out that /etc/ferm/ferm.conf is read and then puppet replaces it with a version that reads /etc/ferm/conf.d/.

another observed effect, not disruptive per se but can be unexpected. After enabling ferm (and thus conntrack) already established tcp sessions will get broken pipe upon receiving the next packet that can't be related by conntrack. this affects e.g. memcache and salt, at which point they will reconnect.

Worth noting this was so noticeable in the elasticsearch case because we have a low tolerance for node loss at the moment. There are some issues open to look into it :)

Fixing the default config will only limit the window; there's still a window between loading /etc/ferm/conf.d/00_main (which sets up the DROP policy) and the later rules which enable the permitted services.

IIRC, this window is does not exist. ferm uses iptables-save in the background. It will first apply the rules and then apply the policy. This also allows it to handle the entire process as a transaction and rollback in case of errors

my take is 1. I fear that 2. will end up being messy and unwieldy and will confuse people.

I have a ferm test setup in labs; I'll give the patched ferm a try in there.

Honestly, I'd prefer option (2). Reversing the Package->File relationships shouldn't be very hard (we probably need an additional File['/etc/ferm'] resource to make sure the directory exists) and is a one-time fix, in contrast with (1) which means that we're stuck patching the ferm package for every new Debian release (and possible stable update).

I thought about the ferm package updates as well. It's mostly a one per Debian release thing though. So once every 2 years. That's why I preferred it. If we can do it nicely in the way of 2 I am fine with it as well

We wouldn't need to rebuild ferm on an ongoing manner; this effect only applies to setups where ferm is reproactively applied to running services. All newly rolled-out services would have ferm rules setup along with the service in the role definitions.

Once we have a complete coverage of systems (w/o the ones which intentionally don't use ferm, of course), we don't need that any longer, so we most probably have that resolved before Debian stretch is released. Also, ferm is static, until now there haven't been any security updates or point update bugfixes for it.

In the end none of the two options was used (since this is effectively a one-time transition). One very usable workaround used on the Kafka brokers was to:

  • run "systemctl mask kafka-server.service"
  • make a puppet run which setups the ferm rules
  • unmask again
  • let puppet restart kafka in a second puppet run