We should try and centralise and standardise how we block, rate-limit and filter traffic across our infrastructure.
We currently have blocking implementations using
- CDN provider
Further to having multiple different implementations for blocking we have multiple locations where the block lists exist. for apache the lists are mostly scattered around the public and private repo. The iptables rules are mostly managed by the ferm::rule puppet resource and can be mostly queried using the puppet db. Varnish has a standard way to block ip addresses using the abuse_networks block in the private repo and also has many add hoc filters and ratelimits for specific endpoints or agents known to cause issues.
The multitude of options and places where data can be blocked can make it difficult for engineers and service owners to know:
- how to add a block
- how/where to check current blocks
- which blocks apply to which services
- where in the stack a block should be added
My personal view is that we should filter traffic behind the caches at the caching layer. If a service is not behind the caching layer then it is likely that it is because it is either a critical service where shared fait is to be avoided or the services is not production. In both of theses cases i think dropping the traffic at iptables without providing a message is sufficient and potentially desirable i.e. fail fast. If this is viewed as too aggressive then i think we should look to see if we can implement some type of standardise blocking logic in envoy as this seems to be the preferred TLS termination layer and would avoid us having to implement blocks in apache, nginx, tomcat etc. However we implement the block though i think the data about what, where and how something is being block should be centralised
This topic came up when discussion the blocking strategy for phabricator abuses. It seems that historically this has been handled in apache, i believe this was implemented before phabricator was behind the caching layer. however now that phabricator is behind the caching layer it can and does make use of a block list in varnish. That said the current block list in varnish is a rather large hammer as it blocks a user to everything behind the caching layer. So i wonder if we should start adding some more granularity to the block lists in varnish so that we could block a user to phabricator (or possibly all developer tools) without blocking access to all WMF resources.
Something elses to consider is how services owners can at the very least see the current block list ideally with some context as to why a user/agent/ip was blocked. potentially an easy way to add users to this list. I believe historically adding users to a block list would always require ops access so this would be additional behaviour; however at least in the phabricator instance viewing the block list with context was possible to to the fact that the block list is stored (potentially incorrectly) in the public puppet repo
I made some steps towards creating a more centralised structure to bocking earlier this year, creating a new hiera block in yaml called abuse_networks. This block is currently used by varnish and ferm and could also be fairly trivially be adapted for use with other daemons managed by puppet such as Apache or to add additional service specific block lists to varnish
The abuse_networks block is however far from perfect:
- its not usable by anything that is not puppet
- its not easy to expose the information (to NDA users) or more specifcly the context around specific entries (IP's)
- requires commit access to the puppet private repo to add entries
- its maintained in a flat yaml file which also maintains a bunch of important other stuff, all of which would be a pain to break during an incident (well anytime more more so ...)