Page MenuHomePhabricator

Firewall sets not being loaded post-reboot due to a @resolve race
Closed, ResolvedPublic

Description

I found three different, recently-rebooted hosts today (eeden, radon, es2015) that had no firewall loaded and only a WARNING alert for conntrack/sysctl.

This looks to me like a boot-time race between networking and the ferm init script, as it looks like ferm failed to load due to @resolve failing.

At minimum, we need a CRITICAL Icinga alert that is included in base::firewall and complains very very loudly if there is no firewall set loaded.

As for potential solutions:

  • Fixing the race, somehow.
  • Deprecating @resolve and falling-back to e.g. subnets/static mappings or Hiera or something.
  • Replacing @resolve with a puppet-time resolving of hostnames, via a custom parser function.

I'm setting this to UBN for security reasons until we have the aforementioned Icinga alert. After we have that, we can probably lower this to High.

Event Timeline

I'll create an Icinga check. As a secondary step it's worth looking into backporting the systemd unit from stretch.

FTR, we already have a parser function called ipresolve() that mostly does what we need here.

Change 318527 had a related patch set uploaded (by Muehlenhoff):
Check whether ferm has been correctly started

https://gerrit.wikimedia.org/r/318527

Obligatory UBN! priority check-in after 2.5 weeks. Is that prio still valid? Should this be prioritized within some team more highly? There's a related patch here and a couple more on the other task this was mentioned in that are not yet merged. (I haven't looked any deeper)

We already have monitoring for this (implicitly via the connection tracking Icinga check), but more explicit monitoring is under way via https://gerrit.wikimedia.org/r/#/c/318527/

Change 318527 merged by Muehlenhoff:
Check whether ferm has been correctly started

https://gerrit.wikimedia.org/r/318527

MoritzMuehlenhoff lowered the priority of this task from Unbreak Now! to High.Nov 14 2016, 3:42 PM

We now have an Icinga check which tests whether ferm is loaded.

As for fixing the actual race; systemd.special(7) lists an nss-lookup.target, which looks promising. I'll test this.

The @resolve "race" is also present in stretch, though more severe since ferm reliably fails when @resolve is used. See also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=863802

The stretch issue is tracked separatetly via T166653

fgiunchedi renamed this task from Firewall sets not being loaded post-reboot due to a @resolve race to Firewall sets not being loaded post-reboot due to a @resolve race on jessie.May 31 2017, 5:19 PM

We believe the workaround for T166653, resulting on ferm loading late also affects network state negatively, causing haproxy automatic restart to failover due to bad state of the firewall. We have to introduce a 10 second delay on haproxy systemd unit to workaround the network issues- proxy does not go back to the previous UP state to prevent flapping on a real scenario.

Change 399377 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] haproxy: Add workaround for ferm starting too late

https://gerrit.wikimedia.org/r/399377

Change 399377 merged by Jcrespo:
[operations/puppet@production] haproxy: Add workaround for ferm starting too late

https://gerrit.wikimedia.org/r/399377

Change 399388 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] haproxy: Increase workaround for ferm starting too late

https://gerrit.wikimedia.org/r/399388

Change 399388 merged by Jcrespo:
[operations/puppet@production] haproxy: Increase workaround for ferm starting too late

https://gerrit.wikimedia.org/r/399388

Boldly resolving since Jessie is deprecated

I think we've still seen these on Stretch, let's keep this open until we're sure that this got fixed in Buster.

MoritzMuehlenhoff renamed this task from Firewall sets not being loaded post-reboot due to a @resolve race on jessie to Firewall sets not being loaded post-reboot due to a @resolve race.Apr 14 2020, 4:11 PM

Also this would need to be reverted: https://phabricator.wikimedia.org/T148986#3850836 before closing the ticket, but I don't think we have a proxy with buster yet to test it.

We haven't seen these for a while to be a general problem. Also, there's monitoring in place, so if it happens again we can revisit specific cases. Closing.