As mentioned in discussion in T90115, we may want to make nginx to limit connection rate per client IP. Depends on external layers passing us the connecting IP.
To really work this needs some sort of global state per IP (or/and sharding) otherwise each IP may still use a fraction of each back end server. Other things use User::pingLimiter() (from mediawiki core, uses increments on memcached keys) or https://wikitech.wikimedia.org/wiki/PoolCounter .
The limit conn nginx module would in this case only limit the connections for each back end server separately.
I think for now it's premature to try and build a full-blown DoS resistant system. I'm not even sure we need per-IP limits, but if we do, unless we're dealing with real DoS, per-backend IP protection probably would be enough. If we get real DoS problem, we'd probably want to solve it on frontend level and not for each backend separately.
A quick analysis of the situation after 2 days of limiting connections:
- around 2k connections have been rate limited over a 24h period (grep limiting /var/log/nginx/error.log.1 | wc -l), which is ~ 2% of the non cached hits.
- the 95-%ile and median response time seems to have decreased (which is good) around the time of the deployment (see Graphite). It does seem that the decrease in response time actually happened before the rate limiting was deployed, so probably not related.
Now that we have some data, it might make sense to do some more analysis to see which users we are blocking and see if that is an issue (at least no one has been screaming yet... but that's not enough of a validation). @Smalyshev, @mpopov, do you have any idea how to approach this analysis?