Implement Varnish-level rough ratelimiting
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Apr 18 2017, 6:24 PM

Description

We've tried this before using libvmod-tbf during the Varnish3 era, and ended up backing it out over some pragmatic issues. The key issues were that:

VCL reloads (which are very routine and constant) were leaking significant memory (in general, but specifically in the libvmod_tbf case it was the large TBF datasets)
The TBF algorithm was structured such that the initial state on server start was "all buckets full" (no burst capacity), as opposed to assuming (at server start, or at seeing a new IP) that buckets are initially empty (in the virtual sense)

We need to resurrect this effort and get it deployed without such issues so that we can build other solutions on it.

Details

Subject	Repo	Branch	Lines +/-
VCL: apply API rate limits to wikidata too	operations/puppet	production	+2 -2
ratelimits: double the default anon limit	operations/puppet	production	+10 -4
VCL: same rate-limit for api.php and restbase	operations/puppet	production	+1 -1
VCL: add Retry-After header to 429 responses	operations/puppet	production	+5 -0
VCL: do not rate limit requests from IPs in wikimedia_trust	operations/puppet	production	+5 -3
VCL: raise ratelimit for RB, exclude labs from limits	operations/puppet	production	+12 -6
VCL: increase rate limit to 600/60s	operations/puppet	production	+2 -2
VCL: rate limit text-frontend requests	operations/puppet	production	+10 -2
VCL: rate limit wikiScrape with vsthrottle	operations/puppet	production	+51 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• ema	T169175 What is a reasonable per-IP ratelimit for maps
Resolved	• ema	T163233 Implement Varnish-level rough ratelimiting
Resolved	• ema	T154704 Rate-limit browsers without referers
Resolved	• GWicke	T118365 Increase request limits for GETs to /api/rest_v1/

Event Timeline

BBlack created this task.Apr 18 2017, 6:24 PM

Restricted Application added a project: SRE. · View Herald TranscriptApr 18 2017, 6:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

BBlack triaged this task as Medium priority.Apr 18 2017, 6:24 PM

BBlack added a subtask: T154704: Rate-limit browsers without referers.

BBlack added a subtask: T118365: Increase request limits for GETs to /api/rest_v1/.

BBlack mentioned this in T126206: Upgrade to Varnish 4: things to remember.Apr 18 2017, 6:27 PM

BBlack moved this task from Backlog to Caching on the Traffic board.

Joe subscribed.Jun 7 2017, 1:55 PM

We should evaluate vmod_vsthrottle, available in the varnish-modules package and see if it's affected by the issues mentioned in this ticket (memory leaks, no burst capacity).

re: vsthrottle, my thoughts after a quick look this morning:

The burst issue seems fine. It initializes fresh buckets with full capacity.
Memory leaks - it seems to have proper fini that looks right. If it leaks memory, it's just because varnish leaks vmod memory when it shouldn't in general, so that problems out of scope here regardless (if we see this problem, we can raise it as a varnish-level issue and potentially block this).
Lock contention - vsthrottle looks a little heavy on the locking. It's notable they partition the keyspace into 16 chunks to split up the mutex contention, and this is a compile-time constant. I think we could live with this, so long as we use this only for the case that really matters, which is ratelimiting FE cache misses+passes rather than hits. Worst case we might need to tweak the compile-time constant to use more chunks.

Beyond all that though, I think the interface/functionality is a bit deficient. Going over some of those points:

There are only two parameters for a bucket: period and limit in this tbf implementation, but not burst. I think, off the top of my head, that we can still configure all the same scenarios as a ratelimiter with a separate burst parameter, but we might have to change the period to be much longer to match the burst. E.g. instead of saying "50/s + 500 burst", we'd have to state it here as "500/10s". I should probably stare at the iptables implementation again and see if it makes a real difference we can't configure our way around (e.g. does iptables 50/s+500b still expire the entries after one second of inactivity, or after 10? that might be the sort of efficiency difference we might see here).
The period and limit are specified in every call, and are part of the bucket's key. So if we want to use the same bucket from multiple places, we have to be careful to keep the parameters in sync or perhaps use a subroutine for a shared bucket (DRY issue, but we have lots of those in VCL anyways).
There's no ability to increment by multiple tokens. Every is_denied() call checks the limit and consumes 1 token. There's no way to ask it to consume 10 tokens from the same shared bucket for certain requests because they're expensive.

Stared at hashtable implementation some more, as well as the linux iptables hashlimit one (which I consider a sort of baseline canonical efficient implementation). The linux one is definitely more-configurable, but I think we can work with what vsthrottle gives us in terms of configurability of the bucket itself. The lack of a cost parameter is still an issue, but it's a long-term issue if/when we get to the point where we define a header that applications can send in their response to indicate heavier costs.

So, let's put up some commits and see what using it would look like? Ideally just for miss/pass (e.g. from vcl_backend_fetch), and able to return a 429 (which is tricky from there, as return (abandon) results in a synth with a pre-set 503 status?).

elukey subscribed.Jun 7 2017, 3:33 PM

Change 357995 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: rate limit wikiScrape with vsthrottle

https://gerrit.wikimedia.org/r/357995

gerritbot added a project: Patch-For-Review.Jun 9 2017, 11:12 AM

Change 357995 merged by Ema:
[operations/puppet@production] VCL: rate limit wikiScrape with vsthrottle

https://gerrit.wikimedia.org/r/357995

Change 358583 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: rate limit API requests

https://gerrit.wikimedia.org/r/358583

Change 358583 merged by Ema:
[operations/puppet@production] VCL: rate limit text-frontend requests

https://gerrit.wikimedia.org/r/358583

Change 358774 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: increase rate limit to 600/60s

https://gerrit.wikimedia.org/r/358774

Change 358774 merged by Ema:
[operations/puppet@production] VCL: increase rate limit to 600/60s

https://gerrit.wikimedia.org/r/358774

Change 358779 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: do not rate limit requests from IPs in wikimedia_trust

https://gerrit.wikimedia.org/r/358779

Change 358779 merged by Ema:
[operations/puppet@production] VCL: do not rate limit requests from IPs in wikimedia_trust

https://gerrit.wikimedia.org/r/358779

Change 358860 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: raise ratelimit for RB, exclude labs from limits

https://gerrit.wikimedia.org/r/358860

Change 358860 merged by BBlack:
[operations/puppet@production] VCL: raise ratelimit for RB, exclude labs from limits

https://gerrit.wikimedia.org/r/358860

Change 358965 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: add Retry-After header to 429 responses

https://gerrit.wikimedia.org/r/358965

• Nuria added a project: Analytics.Jun 14 2017, 4:12 PM

The fact that no requests have been throttled of late in PageviewAPI (see 429 graph below) kind of tells me that PageviewAPI has received too few "lawful" requests as of late though. Are we excluding restbase services from this rate limiting? I think thus far it seems to be eating lawful traffic.

https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1495950536774&to=1496747990508&panelId=7&fullscreen

• GWicke closed subtask T118365: Increase request limits for GETs to /api/rest_v1/ as Resolved.Jun 14 2017, 6:32 PM

Change 358965 merged by Ema:
[operations/puppet@production] VCL: add Retry-After header to 429 responses

https://gerrit.wikimedia.org/r/358965

Change 359137 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: same rate-limit for api.php and restbase

https://gerrit.wikimedia.org/r/359137

Change 359137 merged by Ema:
[operations/puppet@production] VCL: same rate-limit for api.php and restbase

https://gerrit.wikimedia.org/r/359137

• Nuria moved this task from Incoming to Radar on the Analytics board.Jun 15 2017, 3:54 PM

Change 359231 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] ratelimits: double the default anon limit

https://gerrit.wikimedia.org/r/359231

Change 359231 merged by BBlack:
[operations/puppet@production] ratelimits: double the default anon limit

https://gerrit.wikimedia.org/r/359231

Change 359401 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: apply API rate limits to wikidata too

https://gerrit.wikimedia.org/r/359401

Change 359401 merged by Ema:
[operations/puppet@production] VCL: apply API rate limits to wikidata too

https://gerrit.wikimedia.org/r/359401

Gehel mentioned this in T154704: Rate-limit browsers without referers.Jun 28 2017, 3:31 PM

debt added a parent task: T169175: What is a reasonable per-IP ratelimit for maps.Jun 29 2017, 2:36 PM

debt closed subtask T154704: Rate-limit browsers without referers as Resolved.

We've been using vsthrottle in prod for a while now, closing.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Implement Varnish-level rough ratelimitingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement Varnish-level rough ratelimiting
Closed, ResolvedPublic
Actions

Related Objects
Search...