Page MenuHomePhabricator

Create Generalised blocking strategy
Open, MediumPublic

Description

We should try and centralise and standardise how we block, rate-limit and filter traffic across our infrastructure.

We currently have blocking implementations using

  • apache
  • varnish
  • iptables
  • routers
  • CDN provider
  • other???

Further to having multiple different implementations for blocking we have multiple locations where the block lists exist. for apache the lists are mostly scattered around the public and private repo. The iptables rules are mostly managed by the ferm::rule puppet resource and can be mostly queried using the puppet db. Varnish has a standard way to block ip addresses using the abuse_networks block in the private repo and also has many add hoc filters and ratelimits for specific endpoints or agents known to cause issues.

The multitude of options and places where data can be blocked can make it difficult for engineers and service owners to know:

  • how to add a block
  • how/where to check current blocks
  • which blocks apply to which services
  • where in the stack a block should be added

My personal view is that we should filter traffic behind the caches at the caching layer. If a service is not behind the caching layer then it is likely that it is because it is either a critical service where shared fait is to be avoided or the services is not production. In both of theses cases i think dropping the traffic at iptables without providing a message is sufficient and potentially desirable i.e. fail fast. If this is viewed as too aggressive then i think we should look to see if we can implement some type of standardise blocking logic in envoy as this seems to be the preferred TLS termination layer and would avoid us having to implement blocks in apache, nginx, tomcat etc. However we implement the block though i think the data about what, where and how something is being block should be centralised

This topic came up when discussion the blocking strategy for phabricator abuses. It seems that historically this has been handled in apache, i believe this was implemented before phabricator was behind the caching layer. however now that phabricator is behind the caching layer it can and does make use of a block list in varnish. That said the current block list in varnish is a rather large hammer as it blocks a user to everything behind the caching layer. So i wonder if we should start adding some more granularity to the block lists in varnish so that we could block a user to phabricator (or possibly all developer tools) without blocking access to all WMF resources.

Something elses to consider is how services owners can at the very least see the current block list ideally with some context as to why a user/agent/ip was blocked. potentially an easy way to add users to this list. I believe historically adding users to a block list would always require ops access so this would be additional behaviour; however at least in the phabricator instance viewing the block list with context was possible to to the fact that the block list is stored (potentially incorrectly) in the public puppet repo

I made some steps towards creating a more centralised structure to bocking earlier this year, creating a new hiera block in yaml called abuse_networks. This block is currently used by varnish and ferm and could also be fairly trivially be adapted for use with other daemons managed by puppet such as Apache or to add additional service specific block lists to varnish

The abuse_networks block is however far from perfect:

  • its not usable by anything that is not puppet
  • its not easy to expose the information (to NDA users) or more specifcly the context around specific entries (IP's)
  • requires commit access to the puppet private repo to add entries
  • its maintained in a flat yaml file which also maintains a bunch of important other stuff, all of which would be a pain to break during an incident (well anytime more more so ...)

Event Timeline

Change 651171 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] varnish: add phabricator specific ban in varnish

https://gerrit.wikimedia.org/r/651171

jbond updated the task description. (Show Details)
jbond added a subscriber: MoritzMuehlenhoff.

Change 651174 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] varnish: migrate abuse_nets acl to abuse_networks hiera block

https://gerrit.wikimedia.org/r/651174

Volans triaged this task as Medium priority.Dec 21 2020, 1:39 PM
jbond updated the task description. (Show Details)

Change 651174 merged by Jbond:
[operations/puppet@production] varnish: migrate abuse_nets acl to abuse_networks hiera block

https://gerrit.wikimedia.org/r/651174

Change 651171 merged by Jbond:
[operations/puppet@production] varnish: add phabricator specific ban in varnish

https://gerrit.wikimedia.org/r/651171

Thanks for the writeup with all the background! And for the cleanup patches so far :)

Just a few things to add.

Re: the schema of blocks themselves:

  • In addition to being able to specify a particular logical 'service' to which the block/limit applies, I think there should also be two fields (both probably mandatory):
    • a comment/note -- as part of the structured data rather than as a YAML comment. This means we could also display it on the config-master dump. It should probably be customary here to include a Phab task number here for long-standing blocks.
    • an expiration date. (I do not think we should automatically unblock when the expiration date passes; I think instead there should be an IRC alert or email or a ticket opened, so someone can go look.)

There also some open questions as to what it means to specify a particular "service"; for now I think having a few strings we support ('traffic' vs 'phab' vs 'gerrit' etc) might be okay?

In future work we'd want to formalize that list of "services" or "block points" more, and probably move the list out of private hiera into some other data store (which we could then put an SSO-protected webapp around?)

Change 674922 had a related patch set uploaded (by Ema; author: Ema):
[operations/puppet@production] varnish: add phabricator_abusers to traffic cloud project

https://gerrit.wikimedia.org/r/674922

Change 674922 merged by Ema:
[operations/puppet@production] varnish: add phabricator_abusers to traffic cloud project

https://gerrit.wikimedia.org/r/674922

Change 674923 had a related patch set uploaded (by Ema; author: Ema):
[operations/puppet@production] varnish: add text_abuse_nets to traffic cloud project

https://gerrit.wikimedia.org/r/674923

Change 674923 merged by Ema:
[operations/puppet@production] varnish: add text_abuse_nets to traffic cloud project

https://gerrit.wikimedia.org/r/674923

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Change 756611 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:base::firewall: block abuse_networks by default

https://gerrit.wikimedia.org/r/756611

Change 756611 merged by Jbond:

[operations/puppet@production] P:base::firewall: block abuse_networks by default

https://gerrit.wikimedia.org/r/756611

Change 757516 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] C:network: add mx context to abuse_nets

https://gerrit.wikimedia.org/r/757516

Change 757517 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] O:mail::mx: Add mx specific block list

https://gerrit.wikimedia.org/r/757517

Change 757516 merged by Jbond:

[operations/puppet@production] C:network: add mx context to abuse_nets

https://gerrit.wikimedia.org/r/757516

just made the below comment on a tickety which i though may be usefull to capture here to give some context as to what we have today

The layout of the abuse_networks heira block is a bit of a mess TBH. specifically the abuse_networks is defined as

Hash[String, Network::Abuse_net]

The key in the above hash is not really that important (well it is in some cases, more on that later) and is just used for information. the Network::Abuse_net value is further defined as:

type Network::Abuse_net = Struct[{
  context  => Array[Network::Context],
  networks => Array[Stdlib::IP::Address],
  comment  => Optional[String[1]],
}]

With the context array, when paired with network::parse_abuse_nets[1], being the thing that's important. network::parse_abuse_nets essentially parses the abuse_networks block and returns only objects that include the requested context.

So for example in base::firewall[2] we ask for context ferm which will return only the 'blocked_nets' block as that is the only one tagged with ferm. however
varnish::common::vcl[3] will return all blocks as they are all tagged with the varnish context. Varnish then uses the hash keys (which i said where not important above) as specific ACL's in the vcl config which allows us to easily toggle other ACLs e.g. rate-limiting/blocking all cloud providers.

As such most of the lists tagged with varnish are not actively blocked all the time and are just there, as said above, to allows us to more easily toggle ACL's in various other parts of our code/infrastructure.

The one list that is actively blocked is the one named blocked_nets. this is actively blocked via the ferm firewall rules[2] which affects all hosts that have an external IP addresses, and it is also actively blocked via the varnish caches[4].

This list is used as a heavy hammer to block abuses globally, however up until recently the ferm rules where only applied on a select few hosts[5] and specifically where not applied to the MX hosts. this would allow a user that had been blocked globally to still send us an email to try and resolve any issue. however after the most recent change[5] this would no longer be possible. A ban in the abuse_nets list would mean that that IP was essentially balackholed and the user would need to email from a gmail account or something similar.

The intension of this CR is to slightly role back that decision and exclude the MX hosts from the abuse_nets ferm context rules and instead create a new unique list which will only be applied to 'mx' hosts. This new list is likley to be very small (currently one ip addresses) and should only include mail abusers.

By default it will only apply to mail hosts as such if you want to block an IP addresses:

  • globaly including the MX hosts: add the IP to blocked_nets and mx_blocked_nets
  • globaly excluding the MX hosts: add the IP to blocked_nets only
  • Only to the MX hosts: add the IP to mx_blocked_nets only

[1]https://github.com/wikimedia/puppet/blob/production/modules/network/functions/parse_abuse_nets.pp
[2]https://github.com/wikimedia/puppet/blob/production/modules/base/manifests/firewall.pp#L50
[3]https://github.com/wikimedia/puppet/blob/production/modules/varnish/manifests/common/vcl.pp#L29
[4]https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#L664-L670
[5]https://gerrit.wikimedia.org/r/c/operations/puppet/+/756611

The intension of this CR is to slightly role back that decision and exclude the MX hosts from the abuse_nets ferm context rules and instead create a new unique list which will only be applied to 'mx' hosts. This new list is likley to be very small >(currently one ip addresses) and should only include mail abusers.

By default it will only apply to mail hosts as such if you want to block an IP addresses:

  • globaly including the MX hosts: add the IP to blocked_nets and mx_blocked_nets
  • globaly excluding the MX hosts: add the IP to blocked_nets only
  • Only to the MX hosts: add the IP to mx_blocked_nets only

I think we could also take the decision to no bother with this additional complexity and take the stance that if someone is added to abuse_networks then there abuse is such that they will be blocked from the mail infra as well and if there mail server's happen to be in the same (blocked) space, tough!

for information, we care currently blocking 4 IPv4 addresses form separate networks and one /24 network via the global abuse_nets list as such the impact is failry limited

Aklapper renamed this task from Create Generalised blocking stratagy to Create Generalised blocking strategy.Jan 29 2022, 12:28 AM

I think we could also take the decision to no bother with this additional complexity and take the stance that if someone is added to abuse_networks then there abuse is such that they will be blocked from the mail infra as well and if there mail server's happen to be in the same (blocked) space, tough!

I think there's some value to that. Even if the mail server is in the blocked network space, for all practical purposes anyone administrating a email server should also be able to send a mail from a personal address or quickly create a temporary, additional email address.

Thank you for all the work on this ticket and for creating it. I notice that this is a very broad topic and think it would be better if we close this and create smaller tickets with more focused scope.

think it would be better if we close this and create smaller tickets with more focused scope.

i don't think we need to close this ticket instead one can create child tasks for the individual pieces linked to this ticket

Change 757517 abandoned by Jbond:

[operations/puppet@production] O:mail::mx: Add mx specific block list

Reason:

see https://gerrit.wikimedia.org/r/c/operations/puppet/+/822135 for an example of how to do this now

https://gerrit.wikimedia.org/r/757517