Page MenuHomePhabricator

Look into limiting connection rate to WDQS per external IP
Closed, ResolvedPublic

Description

As mentioned in discussion in T90115, we may want to make nginx to limit connection rate per client IP. Depends on external layers passing us the connecting IP.

See also http://nginx.org/en/docs/http/ngx_http_limit_conn_module.html and http://nginx.org/en/docs/http/ngx_http_realip_module.html

Event Timeline

Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev added subscribers: Smalyshev, Joe, csteipp.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

To really work this needs some sort of global state per IP (or/and sharding) otherwise each IP may still use a fraction of each back end server. Other things use User::pingLimiter() (from mediawiki core, uses increments on memcached keys) or https://wikitech.wikimedia.org/wiki/PoolCounter .

The limit conn nginx module would in this case only limit the connections for each back end server separately.

I think for now it's premature to try and build a full-blown DoS resistant system. I'm not even sure we need per-IP limits, but if we do, unless we're dealing with real DoS, per-backend IP protection probably would be enough. If we get real DoS problem, we'd probably want to solve it on frontend level and not for each backend separately.

Change 319010 had a related patch set uploaded (by Smalyshev):
Limit concurrent connections by client IP

https://gerrit.wikimedia.org/r/319010

I propose to start by enabling connection limiting with a fairly high limit and lowering it after a week, once we get some feedback.

I've bumped the limit to 5 concurrent connections, I think we can start with that.

Change 319010 merged by Gehel:
Limit concurrent connections by client IP

https://gerrit.wikimedia.org/r/319010

A quick analysis of the situation after 2 days of limiting connections:

  • around 2k connections have been rate limited over a 24h period (grep limiting /var/log/nginx/error.log.1 | wc -l), which is ~ 2% of the non cached hits.
  • the 95-%ile and median response time seems to have decreased (which is good) around the time of the deployment (see Graphite). It does seem that the decrease in response time actually happened before the rate limiting was deployed, so probably not related.

Now that we have some data, it might make sense to do some more analysis to see which users we are blocking and see if that is an issue (at least no one has been screaming yet... but that's not enough of a validation). @Smalyshev, @mpopov, do you have any idea how to approach this analysis?

@Gehel Can we see if the limited connections are from same IPs or different IPs? I.e., how many IPs are affected and how many each of them got limited?

Smalyshev claimed this task.

I think this is done now. If there's more to do let's create separate tickets.