Page MenuHomePhabricator

Ferm rules for backup roles
Closed, ResolvedPublic

Description

The basic approach is that including base::firewall to a host in site.pp enables a set of basic firewall rules which drop incoming connections by default. In addition the puppet classes of the services running on the host then need to whitelist their traffic.

Many services can be allowed using the ferm::service class:
https://doc.wikimedia.org/puppet/classes/ferm.html#M000641
More complex rules can be be implemented using the ferm::rule class.

First the traffic patterns/ports used by these classes need to be identified and ferm rules added to them:
role::backup::director
role::backup::storage
role::backup::host

Once the ferm rules have been added, base::firewall can be included to the hosts which have ferm rules for all their services.

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)

a backup::host uses:

9102/tcp for bacula-fd

a backup::director uses: (director role is combined with storage role, see helium)

9101/tcp for bacula-dir
9102/tcp for bacula-fd
9103/tcp for bacula-sd

a backup::storage uses: (when only storage, see heze)

9103/tcp for bacula-sd

backup::host's are already covered when including base::firewall. see line 44 of role/backup.pp

Change 223849 had a related patch set uploaded (by Dzahn):
ferm rules for bacula director

https://gerrit.wikimedia.org/r/223849

Change 223851 had a related patch set uploaded (by Dzahn):
ferm fules for bacula storage

https://gerrit.wikimedia.org/r/223851

Dzahn added a subscriber: akosiaris.

Change 223851 abandoned by Dzahn:
ferm rules for bacula storage

Reason:
included in https://gerrit.wikimedia.org/r/#/c/223849/

https://gerrit.wikimedia.org/r/223851

Change 223849 merged by Alexandros Kosiaris:
ferm rules for bacula

https://gerrit.wikimedia.org/r/223849

summarizing what is done and what is not:

heze (storage): done
helium (director, storage): NOT done

lithium (host, syslog): done
terbium (host,maintenance): not done

other hosts are added via roles

Change 229054 had a related patch set uploaded (by Dzahn):
bacula: enable firewall on helium

https://gerrit.wikimedia.org/r/229054

Change 229054 merged by Dzahn:
bacula: enable firewall on helium

https://gerrit.wikimedia.org/r/229054

Dzahn triaged this task as Medium priority.Aug 6 2015, 8:26 PM

root@helium:~# /etc/init.d/ferm reload

  • Reloading Firewall configuration... iptables-restore: line 27 failed

Failed to run /sbin/iptables-restore
ip6tables-restore: line 28 failed
Failed to run /sbin/ip6tables-restore


867 Aug 6 20:29:51 helium kernel: [2422180.176691] ip6_tables: (C) 2000-2006 Netfilter Core Team
868 Aug 6 20:29:52 helium kernel: [2422181.102840] x_tables: ip_tables: NOTRACK target: only valid in raw table, not filter
869 Aug 6 20:29:52 helium kernel: [2422181.123525] x_tables: ip6_tables: NOTRACK target: only valid in raw table, not filter
870 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Failed to call refresh: Could not start Service[f erm]: Execution of '/etc/init.d/ferm start' returned 1:
871 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Could not start Service[ferm]: Execution of '/etc /init.d/ferm start' returned 1:


after that fail i ran

#!/bin/sh
# removes all iptables rules
# https://en.wikipedia.org/wiki/Tear_down_this_wall!
echo "flushing all iptables rules.."
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
echo "done"

20150806-poolcounter

Summary

A firewall change was merged on server helium which serves as Bacula director and also as a poolcounter.
There was a failure to start the ferm service. As soon as iptables rules were loaded the conntrack table filled up.
Packets and connections to the poolcounter were dropped.
This caused an API outage (cached things were unaffected).

Timeline

~ 20:29 UTC - gerrit:229054 gets merged to complete T104996 and apply ferm rules on helium, ferm rules for both bacula and poolcounter exist
~ 20:31 UTC - Daniel runs puppet on helium and sees ferm fails to start with error [1]
~ 20:32 UTC - Icinga starts to report Socket timeouts for Apache HTTP and HHVM rendering
~ 20:33 UTC - Daniel runs manual script to flush all iptables rules that was there in case ferm fails [2], disables puppet agent
~ 20:40 UTC - Brandon starts looking at helium, connects via mgmt and finds [3] shortly after
~ 20:42 UTC - Daniel reverts gerrit:229054
~ 20:43 UTC - Daniel re-enables puppet-agent, attempts run, gets fails because helium fails DNS lookups, packets are still dropped
~ 20:43 UTC - Brandon manually rmmod's all the iptables kernel modules
~ 20:45 UTC - Icinga RECOVERies start showing up

Conclusions

  • avoid the issue with the NOTRACK target in ferm
  • if ferm fails and conntrack table fills up, rmmod kernel modules, flushing all tables is not enough
  • poolcounter is a SPOF (T105378)

Actionables

[1]

867 Aug 6 20:29:51 helium kernel: [2422180.176691] ip6_tables: (C) 2000-2006 Netfilter Core Team
868 Aug 6 20:29:52 helium kernel: [2422181.102840] x_tables: ip_tables: NOTRACK target: only valid in raw table, not filter
869 Aug 6 20:29:52 helium kernel: [2422181.123525] x_tables: ip6_tables: NOTRACK target: only valid in raw table, not filter
870 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Failed to call refresh: Could not start Service[f erm]: Execution of '/etc/init.d/ferm start' returned 1:
871 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Could not start Service[ferm]: Execution of '/etc /init.d/ferm start' returned 1:

[2]

#!/bin/sh
# removes all iptables rules
# https://en.wikipedia.org/wiki/Tear_down_this_wall!
echo "flushing all iptables rules.."
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
echo "done"

[3]

Aug  6 20:30:32 helium kernel: [2422220.736092] nf_conntrack: table full, dropping packet.

[[Category:Incident documentation]]

> `

#!/bin/sh

  1. removes all iptables rules
  2. https://en.wikipedia.org/wiki/Tear_down_this_wall!

echo "flushing all iptables rules.."
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
echo "done"

Actually this is wrong. You first need to set the policies and then flush the tables of all rules (-F). Otherwise you are gonna get locked out of the box (as it happenned)

Sigh, so NOTRACK failed due to being applied on the wrong table and then disaster ensued.

Is it me or this incident documentation is not in https://wikitech.wikimedia.org/wiki/Incident_documentation ? I seem to be unable to find it ...

Change 230049 had a related patch set uploaded (by Alexandros Kosiaris):
Temporarily disable helium's poolcounter

https://gerrit.wikimedia.org/r/230049

Change 230049 merged by jenkins-bot:
Temporarily disable helium's poolcounter

https://gerrit.wikimedia.org/r/230049

Change 230052 had a related patch set uploaded (by Alexandros Kosiaris):
ferm: NOTRACK needs to be applied on raw table

https://gerrit.wikimedia.org/r/230052

Change 230053 had a related patch set uploaded (by Alexandros Kosiaris):
Revert "Revert "bacula: enable firewall on helium""

https://gerrit.wikimedia.org/r/230053

Change 230052 merged by Alexandros Kosiaris:
ferm: NOTRACK needs to be applied on raw table

https://gerrit.wikimedia.org/r/230052

Change 230053 merged by Alexandros Kosiaris:
Revert "Revert "bacula: enable firewall on helium""

https://gerrit.wikimedia.org/r/230053

After fixing the bug with NOTRACK with https://gerrit.wikimedia.org/r/230052 and re-enabling firewall with https://gerrit.wikimedia.org/r/230053 on helium, things seem to be OK. I 'll repool helium as a poolcounter with https://gerrit.wikimedia.org/r/#/c/230054/ and hopefully we should be OK

After repooling helium as a poolcounter backend, things are working just fine. Resolving this

Actionables

I've created https://phabricator.wikimedia.org/T108303 to detect failing ferm restarts