Look into limiting connection rate to WDQS per external IP
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Aug 8 2015, 8:31 PM

Description

As mentioned in discussion in T90115, we may want to make nginx to limit connection rate per client IP. Depends on external layers passing us the connecting IP.

Details

	Subject	Repo	Branch	Lines +/-
	Limit concurrent connections by client IP	operations/puppet	production	+14 -1

Customize query in gerrit

Related Objects

Mentioned In: T90115: BlazeGraph Security Review
Mentioned Here: T90115: BlazeGraph Security Review

Event Timeline

Smalyshev created this task.Aug 8 2015, 8:31 PM

Smalyshev raised the priority of this task from to Medium.

Smalyshev updated the task description. (Show Details)

Smalyshev added a project: Wikidata-Query-Service.

Smalyshev added subscribers: Smalyshev, Joe, • csteipp.

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptAug 8 2015, 8:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Smalyshev mentioned this in T90115: BlazeGraph Security Review.Aug 8 2015, 8:32 PM

To really work this needs some sort of global state per IP (or/and sharding) otherwise each IP may still use a fraction of each back end server. Other things use User::pingLimiter() (from mediawiki core, uses increments on memcached keys) or https://wikitech.wikimedia.org/wiki/PoolCounter .

The limit conn nginx module would in this case only limit the connections for each back end server separately.

I think for now it's premature to try and build a full-blown DoS resistant system. I'm not even sure we need per-IP limits, but if we do, unless we're dealing with real DoS, per-backend IP protection probably would be enough. If we get real DoS problem, we'd probably want to solve it on frontend level and not for each backend separately.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Aug 11 2015, 2:22 PM

Smalyshev moved this task from Incoming to Need investigation on the Wikidata-Query-Service board.Oct 19 2015, 10:01 PM

Smalyshev moved this task from Needs triage to Search on the Discovery-ARCHIVED board.Jan 17 2016, 4:11 AM

Smalyshev moved this task from Search to Needs triage on the Discovery-ARCHIVED board.

Smalyshev moved this task from Needs triage to WDQS on the Discovery-ARCHIVED board.Jan 17 2016, 4:14 AM

Smalyshev moved this task from Need investigation to Operations/SRE on the Wikidata-Query-Service board.Oct 11 2016, 5:53 PM

Change 319010 had a related patch set uploaded (by Smalyshev):
Limit concurrent connections by client IP

https://gerrit.wikimedia.org/r/319010

gerritbot added a project: Patch-For-Review.Nov 1 2016, 1:01 AM

I propose to start by enabling connection limiting with a fairly high limit and lowering it after a week, once we get some feedback.

I've bumped the limit to 5 concurrent connections, I think we can start with that.

Change 319010 merged by Gehel:
Limit concurrent connections by client IP

https://gerrit.wikimedia.org/r/319010

A quick analysis of the situation after 2 days of limiting connections:

around 2k connections have been rate limited over a 24h period (grep limiting /var/log/nginx/error.log.1 | wc -l), which is ~ 2% of the non cached hits.
the 95-%ile and median response time seems to have decreased (which is good) around the time of the deployment (see Graphite). It does seem that the decrease in response time actually happened before the rate limiting was deployed, so probably not related.

Now that we have some data, it might make sense to do some more analysis to see which users we are blocking and see if that is an issue (at least no one has been screaming yet... but that's not enough of a validation). @Smalyshev, @mpopov, do you have any idea how to approach this analysis?

@Gehel Can we see if the limited connections are from same IPs or different IPs? I.e., how many IPs are affected and how many each of them got limited?

I think this is done now. If there's more to do let's create separate tickets.

Look into limiting connection rate to WDQS per external IPClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Look into limiting connection rate to WDQS per external IP
Closed, ResolvedPublic
Actions