Page MenuHomePhabricator

Ferm rules for elasticsearch
Closed, ResolvedPublic

Description

The basic approach is that including base::firewall to a host in site.pp enables a set of basic firewall rules which drop incoming connections by default. In addition the puppet classes of the services running on the host then need to whitelist their traffic.

Many services can be allowed using the ferm::service class:
https://doc.wikimedia.org/puppet/classes/ferm.html#M000641
More complex rules can be be implemented using the ferm::rule class.

First the traffic patterns/ports used by these classes need to be identified and ferm rules added to them:
elasticsearch::server

Oce the ferm rules have been added, base::firewall can be included to the hosts which have ferm rules for all their services.

Revisions and Commits

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)

Change 224095 had a related patch set uploaded (by Muehlenhoff):
WIP: ferm rules for elasticsearch

https://gerrit.wikimedia.org/r/224095

Change 230955 had a related patch set uploaded (by Dzahn):
elasticsearch: add cluster hosts to hiera

https://gerrit.wikimedia.org/r/230955

Change 230955 merged by Dzahn:
elasticsearch: add cluster hosts to hiera

https://gerrit.wikimedia.org/r/230955

Change 224095 merged by Muehlenhoff:
ferm rules for elasticsearch

https://gerrit.wikimedia.org/r/224095

Regarding the elastic1001 problem (where we have concerns about instability on the master). We discussed "preseeding" the ferm stuff to ensure a full config.

I created:

aptitude install ferm

aptitude install conntrack libnet-dns-perl

mkdir /etc/ferm/conf.d
cp /root/ferm/functions.conf /etc/ferm/functions.conf
cp /root/ferm/ferm.conf /etc/ferm/ferm.conf
cp /root/ferm/conf.d/* /etc/ferm/conf.d/

chmod 400 /etc/ferm/ferm.conf
chmod 400 /etc/ferm/functions.conf
chmod 500 /etc/ferm/conf.d
chmod 400 /etc/ferm/conf.d/*

chown root:root /etc/ferm/ferm.conf
chown root:root /etc/ferm/functions.conf
chown -R root:root /etc/ferm/conf.d
chown root:root /etc/ferm/conf.d/*

diff -u /root/ferm/ferm.conf /etc/ferm/ferm.conf

By applying the above (with a /etc/ferm/ dir from an existing host) I was able watch watch icmp traffic to look for any loss with:

ping elastic1001 -D -i 0.2

The result with the above preseeding was:

344 packets transmitted, 344 received, 0% packet loss, time 68703ms
rtt min/avg/max/mdev = 0.070/0.601/41.288/3.631 ms

I saw no instability in the cluster, although it has been rac-y and not at all consistent tbf.

I'm leaving 1-3 for tomorrow when I think we should do all 3 in this manner and I think it will be alright.

Note, I am rolling out https://gerrit.wikimedia.org/r/#/c/235048/ which should fix wikitech search now

We attempted to enable ferm on the active master node today (and the only remaining node). I 'preseeded' the ferm configuration as referenced above and here is the puppet output:

root@elastic1001:~# puppet agent --enable && puppet agent --test
Info: Retrieving plugin
Notice: /File[/var/lib/puppet/lib]/mode: mode changed '0755' to '0775'
Notice: /File[/var/lib/puppet/lib/puppet]/mode: mode changed '0755' to '0775'
Notice: /File[/var/lib/puppet/lib/puppet/provider]/mode: mode changed '0775' to '0755'
Notice: /File[/var/lib/puppet/lib/facter]/mode: mode changed '0755' to '0775'
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/labsproject.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Caching catalog for elastic1001.eqiad.wmnet
Info: Applying configuration version '1441129515'
Notice: /Stage[main]/Ferm/File[/etc/ferm/conf.d]/group: group changed 'root' to 'adm'
Notice: /Stage[main]/Ferm/File[/etc/ferm/conf.d]/mode: mode changed '2500' to '0500'
Info: /Stage[main]/Ferm/File[/etc/ferm/conf.d]: Scheduling refresh of Service[ferm]
Info: /Stage[main]/Ferm/File[/etc/ferm/conf.d]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Ferm/File[/etc/default/ferm]/content:
--- /etc/default/ferm	2015-09-01 17:44:42.047688752 +0000
+++ /tmp/puppet-file20150901-1720-1hocm7w	2015-09-01 17:45:42.080611264 +0000
@@ -6,8 +6,9 @@
 # cache the output of ferm --lines in /var/cache/ferm?
 CACHE=yes

-# additional paramaters for ferm (like --def '=bar')
+# additional paramaters for ferm (like --def '$foo=bar')
 OPTIONS=

-# Enable the ferm init script? (i.e. run on bootup)
-ENABLED="no"
+# Enable ferm on bootup?
+ENABLED=yes
+

Info: /Stage[main]/Ferm/File[/etc/default/ferm]: Filebucketed /etc/default/ferm to puppet with sum 46b9229b866048679a9e8a771df022d9
Notice: /Stage[main]/Ferm/File[/etc/default/ferm]/content: content changed '{md5}46b9229b866048679a9e8a771df022d9' to '{md5}0efc32eb1a5bdd252d8e48b40912db79'
Notice: /Stage[main]/Ferm/File[/etc/default/ferm]/mode: mode changed '0644' to '0400'
Info: /Stage[main]/Ferm/File[/etc/default/ferm]: Scheduling refresh of Service[ferm]
Info: /Stage[main]/Ferm/File[/etc/default/ferm]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Role::Elasticsearch::Server/Ferm::Service[elastic-http]/File[/etc/ferm/conf.d/10_elastic-http]/content:
--- /etc/ferm/conf.d/10_elastic-http	2015-09-01 17:44:52.331846789 +0000
+++ /tmp/puppet-file20150901-1720-1ohgggw	2015-09-01 17:45:43.388631364 +0000
@@ -1,6 +1,6 @@
 # Autogenerated by puppet. DO NOT EDIT BY HAND!
 #
-#
+#
 &R_SERVICE(tcp, 9200, (($INTERNAL @resolve(silver.wikimedia.org))));



Info: /Stage[main]/Role::Elasticsearch::Server/Ferm::Service[elastic-http]/File[/etc/ferm/conf.d/10_elastic-http]: Filebucketed /etc/ferm/conf.d/10_elastic-http to puppet with sum 3da75f2925511c8852f67b368ac065ec
Notice: /Stage[main]/Role::Elasticsearch::Server/Ferm::Service[elastic-http]/File[/etc/ferm/conf.d/10_elastic-http]/content: content changed '{md5}3da75f2925511c8852f67b368ac065ec' to '{md5}8754be5fd584e8acb03df8ebc8ccc482'
Info: /Stage[main]/Role::Elasticsearch::Server/Ferm::Service[elastic-http]/File[/etc/ferm/conf.d/10_elastic-http]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Base::Firewall/File[/usr/lib/nagios/plugins/check_conntrack]/ensure: defined content as '{md5}a8aff5f009ffc6d68e0a2397ff67f0bb'
Notice: /Stage[main]/Ferm/Service[ferm]: Triggered 'refresh' from 5 events
Notice: /Stage[main]/Base::Firewall/Nrpe::Monitor_service[conntrack_table_size]/Nrpe::Check[check_conntrack_table_size]/File[/etc/nagios/nrpe.d/check_conntrack_table_size.cfg]/ensure: created
Info: /Stage[main]/Base::Firewall/Nrpe::Monitor_service[conntrack_table_size]/Nrpe::Check[check_conntrack_table_size]/File[/etc/nagios/nrpe.d/check_conntrack_table_size.cfg]: Scheduling refresh of Service[nagios-nrpe-server]
Notice: /Stage[main]/Nrpe/Service[nagios-nrpe-server]: Triggered 'refresh' from 1 events
Notice: Finished catalog run in 32.40 seconds
root@elastic1001:~#

The wikitech search allow was new but that shouldn't have caused any issues with 9200/9300 traffic. Shortly after we applied this the master began losing track of nodes and the cluster went red. This generally means primary shards are missing and indexing/writes will return an exception and reads will return incomplete results. I stopped ferm on the host and verified the iptables rules were flushed. The cluster slowly began recovering, but it took several hours. There was scattered user impact but real failure to serve requests was probably 10 minutes or so we think. Generally it was just cluster recovery in yellow state for the long haul.

Questions (to be discussed on a call tomorrow):

  1. Was this related to the brief conntrack blip for connectivity on start?
  2. If this is a general FW rule issue with our master setup why would logstash not have the same problem? (I don't think it is)
  3. How can we gracefully migrate the master role to another node already running Ferm? (I'm not sure we can as this cluster is really unstable)
  4. Best acceptable options for moving forward?

For now I have excluded 1001 from the list of ferm enforced nodes https://phabricator.wikimedia.org/rOPUP824b214e76a845b22e7fe12065df42c756712ae5

We had a meeting with the following outcomes:

  1. We moved all shards off of the master
  2. We picked a relative low traffic time
  3. We removed the master from LVS (no user searches)
  4. I issued service elasticsearch stop && service ferm start && service elasticsearch start
  5. Saw both elastic and ferm come up fine
  6. Saw master role migration to 1008
  7. Verified cluster health
  8. reverted LVS and icinga maint

The master has about double the connections of a standard node but that seems to stay below 900. No real worries at this moment.