Page MenuHomePhabricator

Firewall configurations for database hosts
Closed, ResolvedPublic

Description

One of the quarterly goals is a more complete firewall coverage. Our firewall configuration is based on ferm (http://ferm.foo-projects.org/download/2.2/ferm.htm) and can be
configured through the ferm puppet module.

The basic approach is that including base::firewall to a host in site.pp enables a set of basic firewall rules which drop incoming connections by default. In addition the puppet classes of the services running on the host then need to whitelist their traffic.

Many services can be allowed using the ferm::service class:
https://doc.wikimedia.org/puppet/classes/ferm.html#M000641
More complex rules can be be implemented using the ferm::rule class.

First the traffic patterns/ports used by these classes need to be identified and ferm rules added to them:
role::coredb::*
role::mariab::core
role::mariadb::dbstore
role::mariabd::proxy

Once the ferm rules have been added, base::firewall can be included to the hosts which
have ferm rules for all their services.

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 3 2015, 12:30 PM
jcrespo added a subscriber: jcrespo.Jul 3 2015, 1:11 PM

Question:

  • has iptables impact been measured? This may sound stupid, and I have never heard of such a thing (after all, it is kernel code), but please note that some servers have uncommon patterns, having 20000 connections open at the same time and thousands being created per seconds, to the point that we had to increase the timeout for TCP connection from 1 second to 3. We should do a test to profile the network.

Suggestion:

  • Can we enable this in "LOG" mode before the "DROP" mode? This will help us detect 99% of issues beforehand. Is it something that will be eventually sent to kibana?

You have 100% my support, although this should be split on smaller tasks and keep more people on the loop.

  • has iptables impact been measured? This may sound stupid, and I have never heard of such a thing (after all, it is kernel code), but please note that some servers have uncommon patterns, having 20000 connections open at the same time and thousands being created per seconds, to the point that we had to increase the timeout for TCP connection from 1 second to 3. We should do a test to profile the network.

Sure, this should be tested on a few test hosts beforehand. netfilter is general is fairly performant, but it's difficul to quantify the overhead in general, that's very much dependant on the final rules etc. So we should test this to get some numbers.

  • Can we enable this in "LOG" mode before the "DROP" mode? This will help us detect 99% of issues beforehand.

Certainly for the debugging/implementation phase.

Is it something that will be eventually sent to kibana?

That needs some thought at a later point, let's rather make these changes one at a time.

You have 100% my support, although this should be split on smaller tasks and keep more people on the loop.

Sure, there's a lot of various db roles and shards, so please split this in sub tickets at your discretion.

BBlack added a subscriber: BBlack.Jul 3 2015, 1:33 PM

Question:

  • has iptables impact been measured?

I don't know that it will matter for most hosts, but for our high-volume traffic hosts (e.g. LVS and cache machines), we avoid iptables currently for performance/limitation reasons. Even if we were to enable some iptables rules there, we'd want to be very sure they were stateless (not invoking conn-tracking mode for automatic TCP in the reverse direction, etc).

Probably all LVS-destination hosts (caches, appservers, etc) need special consideration as well, as they have alternate non-primary IP addresses listed on the loopback interface, which they make use of over the primary ethernet as well.

Question:

  • has iptables impact been measured?

I don't know that it will matter for most hosts, but for our high-volume traffic hosts (e.g. LVS and cache machines), we avoid iptables currently for performance/limitation reasons.

The quarterly goal accounts for that with the "explicitly assessed as unwanted and not needed" clause :-)
I'll create a separate Phab task to collect those services which have been decided as such (to properly keep them on record).

Doing this to most production DBs seems straight forward. Pain points, due just to complexity, will be on M[1-4].

My expectation is that iptables won't be a performance bottleneck, but +1 to doing some testing.

Another slightly tangential thing to consider: We (or I) often pipe around data on non-standard ports using netcat and xtrabackup when cloning servers. We should formalize how that is done, or something...

jcrespo added a comment.EditedJul 4 2015, 3:27 PM

M[1-5] :-)

And it is definitely a We!

We could standardize on 4444, which is not assigned according to /etc/services and the one Galera uses for xtrabackup SSTs http://galeracluster.com/documentation-webpages/firewallsettings.html

(... among db* nodes).

Wasn't sure if you wanted to change that process :)

4444 would be fine. All sounds good, then.

Dzahn added a parent task: Restricted Task.Jul 7 2015, 2:35 AM

Change 225851 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for dbproxy hosts

https://gerrit.wikimedia.org/r/225851

MoritzMuehlenhoff set Security to None.

Change 226068 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for mariadb labsdb

https://gerrit.wikimedia.org/r/226068

Change 226267 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for dbstore systems

https://gerrit.wikimedia.org/r/226267

Change 226068 merged by Dzahn:
Add ferm rules for mariadb labsdb

https://gerrit.wikimedia.org/r/226068

Change 228228 had a related patch set uploaded (by Muehlenhoff):
Create a common ferm base class for the database hosts and move the existing labsdb slave definition over to it

https://gerrit.wikimedia.org/r/228228

Change 228237 had a related patch set uploaded (by Muehlenhoff):
Enable ferm rules for role::mariadb::dbstore

https://gerrit.wikimedia.org/r/228237

Change 226267 abandoned by Muehlenhoff:
Add ferm rules for dbstore systems

Reason:
Continued in 228237

https://gerrit.wikimedia.org/r/226267

Change 228239 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for role::mariadb::proxy

https://gerrit.wikimedia.org/r/228239

Change 225851 abandoned by Muehlenhoff:
Add ferm rules for dbproxy hosts

Reason:
Abandon in favour of new 228239 change using role::mariadb::ferm

https://gerrit.wikimedia.org/r/225851

Change 228782 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for role::mariadb::misc::phabricator

https://gerrit.wikimedia.org/r/228782

Change 228783 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for role::mariadb::misc

https://gerrit.wikimedia.org/r/228783

Change 228804 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for role::mariadb::core

https://gerrit.wikimedia.org/r/228804

Change 228806 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for coredb classes

https://gerrit.wikimedia.org/r/228806

Change 228228 merged by Muehlenhoff:
Create a common ferm base class for the database hosts and move the existing labsdb slave definition over to it

https://gerrit.wikimedia.org/r/228228

Change 228237 merged by Muehlenhoff:
Enable ferm rules for role::mariadb::dbstore

https://gerrit.wikimedia.org/r/228237

Change 228239 merged by Muehlenhoff:
Add ferm rules for role::mariadb::proxy

https://gerrit.wikimedia.org/r/228239

Change 228783 merged by Muehlenhoff:
Add ferm rules for role::mariadb::misc

https://gerrit.wikimedia.org/r/228783

The snapshot hosts needs mysql access to production hosts (in theory, only to snapshot hosts, but the configuration is not on puppet, but in mediawiki-config). Snapshot hosts do not have 10.x ips.

Change 228782 merged by Muehlenhoff:
Add ferm rules for role::mariadb::misc::phabricator

https://gerrit.wikimedia.org/r/228782

Change 234489 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for mariadb sanitarium

https://gerrit.wikimedia.org/r/234489

Change 234489 merged by Muehlenhoff:
Add ferm rules for mariadb sanitarium

https://gerrit.wikimedia.org/r/234489

Change 228804 abandoned by Muehlenhoff:
Add ferm rules for role::mariadb::core

Reason:
Duplicate, already merged

https://gerrit.wikimedia.org/r/228804

Change 240042 had a related patch set uploaded (by Muehlenhoff):
Add ferm rules for role::mariadb::misc::eventlogging

https://gerrit.wikimedia.org/r/240042

Change 240042 merged by Muehlenhoff:
Add ferm rules for role::mariadb::misc::eventlogging

https://gerrit.wikimedia.org/r/240042

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 19 2015, 3:41 PM

Change 228806 abandoned by Muehlenhoff:
Add ferm rules for coredb classes

Reason:
Already enabled in 75fbe9b62d76f77f7479a3784463d30ec28ca328

https://gerrit.wikimedia.org/r/228806

given that "role::coredb::*" is to be deprecated, I would close this ticket after:

  1. a pending-to-be-applied-ferm db hosts is listed
  2. new rules deleting iron are applied

I believe only a few hosts would be pending (misc?), for which I would create another, lower precendence ticket

Sounds good to me, I'll update the ticket later with the list of hosts.

These are the remaining db systems without base::firewall:

Servers using coredb, which are re-setup with mariadb::core:
db1052.eqiad.wmnet
db1038.eqiad.wmnet
db1040.eqiad.wmnet
db1058.eqiad.wmnet .
db1023.eqiad.wmnet
db1033.eqiad.wmnet
db1029.eqiad.wmnet
db1001.eqiad.wmnet

Systems from the misc shard, which hadn't been failovered:
db1009.eqiad.wmnet
db1016.eqiad.wmnet
db1020.eqiad.wmnet

This is a server, which is apparently no longer in use: T129395
db1010.eqiad.wmnet

tendril:
db1011.eqiad.wmnet

And there are a few systems in codfw, which don't have it yet:
db2001.codfw.wmnet
db2002.codfw.wmnet
db2003.codfw.wmnet
db2004.codfw.wmnet
db2005.codfw.wmnet
db2007.codfw.wmnet
db2011.codfw.wmnet
db2033.codfw.wmnet

And finally the db proxies:
dbproxy1001.eqiad.wmnet .
dbproxy1002.eqiad.wmnet .
dbproxy1003.eqiad.wmnet
dbproxy1004.eqiad.wmnet
dbproxy1005.eqiad.wmnet
dbproxy1008.eqiad.wmnet

Updated list, now really small:

db1009.eqiad.wmnet
db1020.eqiad.wmnet
db2011.codfw.wmnet
db1010.eqiad.wmnet (apparently no longer in use: T129395)

And finally the db proxies:
dbproxy1001.eqiad.wmnet .
dbproxy1002.eqiad.wmnet .
dbproxy1003.eqiad.wmnet
dbproxy1004.eqiad.wmnet
dbproxy1005.eqiad.wmnet
dbproxy1008.eqiad.wmnet

jcrespo moved this task from Triage to Next on the DBA board.

We should be able to do db1009 at any time, I do not have a good candidate for failover on eqiad; but most services should be able to handle some seconds of unavailability without problem (specially, once dns is no longer there). Alternative, we could try to decommission it and failover it to a more recent database server.

db1020 has pending cleanup upgrade; we could also do it without any issue, but I would love to make a maintenance at the same time than an upgrade.

db2011 can be done at any time (but at a different time than db1020).

db1010 has not been yet properly decommissioned. The ticket there tracks it is pending.

The dbproxies have a failover right now (very unefficient, but it is that or having the machines unused until we have a better alternative T141547). We should do the passive ones (the ones not in the dns right now) now, then failover, and do the rest.

jcrespo moved this task from Next to Meta/Epic on the DBA board.Nov 10 2016, 12:04 PM
jcrespo triaged this task as Medium priority.Nov 24 2016, 11:56 AM

Change 377460 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Add new m1 host db2078, enable firewall on all misc services

https://gerrit.wikimedia.org/r/377460

Change 377460 merged by Jcrespo:
[operations/puppet@production] Add new m1 host db2078, enable firewall on all misc services

https://gerrit.wikimedia.org/r/377460

I think after the above patch, only the proxies are missing?

Change 399164 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall

https://gerrit.wikimedia.org/r/399164

Change 399164 merged by Jcrespo:
[operations/puppet@production] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall

https://gerrit.wikimedia.org/r/399164

Change 399194 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbproxy: Apply both regular and cloud only exception for 'cloud'

https://gerrit.wikimedia.org/r/399194

Change 399194 merged by Jcrespo:
[operations/puppet@production] dbproxy: Apply both regular and cloud only exception for 'cloud'

https://gerrit.wikimedia.org/r/399194

Firewall has been enabled on all proxies except the active ones:

dbproxy1002.yaml:profile::mariadb::proxy::firewall: 'disabled'
dbproxy1003.yaml:profile::mariadb::proxy::firewall: 'disabled'
dbproxy1006.yaml:profile::mariadb::proxy::firewall: 'disabled'
dbproxy1009.yaml:profile::mariadb::proxy::firewall: 'disabled'
dbproxy1010.yaml:profile::mariadb::proxy::firewall: 'disabled'
dbproxy1011.yaml:profile::mariadb::proxy::firewall: 'disabled'
jcrespo moved this task from Meta/Epic to Done on the DBA board.Apr 4 2018, 1:09 PM

@MoritzMuehlenhoff We should have a full coverage of form on all db and proxy hosts, with the exception of dbproxy1010 and dbproxy1011 that it is managed with the state 'disabled' (they are public hosts with only haproxy pointing to labsdb public hosts).

As far as I know, this ticket should be complete (full ferm coverage).

on modules/profile/manifests/mariadb/ there are 3 possibilities (ferm.pp, ferm_misc.pp and ferm_wmcs.pp) which are used for internal rules only -most of the hosts-, misc with open ports to public ip clients and labs support hosts with open ports to labs/public network).

Please double check if you are ok with this. We could and should open a separate one to improve the proxy configuration to better ranges, but that may require a better puppetization model than the current one (for better client-server coordination of services).

MoritzMuehlenhoff closed this task as Resolved.Aug 6 2019, 6:55 AM

This is complete.