Add rate limiter functionality to service-runner
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Jan 28 2016, 6:38 PM

Description

Several services could benefit from a global rate limiter service. Similar to statsd an logging, this kind of service would be a good fit in service-runner.

In the past, @Pchelolo has investigated & tested redis-based rate limiter libraries. The downside of those is a) complexity of needing to host & maintain another storage system, and b) a single point of failure.

An alternative is to use a DHT like Uber's ringpop for rate limiting. Their hyperbahn project actually implements a rate limiter on top of ringpop, so it might be possible to borrow most of the code.

The downside of DHTs like ringpop is that nodes running a particular service need to communicate, which means that at least one node IP needs to be communicated across the cluster. However, this requirement is not very different from redis.

Related Objects
Search...

Status	Assigned	Task
Open	None	T111534 Allow external users access to cxserver
Resolved	santhosh	T101398 cxserver: rate limiting
Resolved	• GWicke	T125123 Add rate limiter functionality to service-runner

Event Timeline

• GWicke created this task.Jan 28 2016, 6:38 PM

• GWicke raised the priority of this task from to Medium.

• GWicke updated the task description. (Show Details)

• GWicke added a project: service-runner.

• GWicke added subscribers: • GWicke, • Pchelolo.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2016, 6:38 PM

• GWicke mentioned this in T101398: cxserver: rate limiting.Jan 28 2016, 6:39 PM

• GWicke added a parent task: T101398: cxserver: rate limiting.

• GWicke added a project: Services-next.

• GWicke set Security to None.

santhosh subscribed.Jan 29 2016, 4:09 AM

• Nikerabbit subscribed.Jan 29 2016, 9:28 AM

• Pchelolo merged a task: T107934: Reliable and scaleable rate limiting mechanism for RESTBase API entry points.Feb 1 2016, 6:35 PM

• Pchelolo added subscribers: Krenair, Eevans, Joe, • mobrovac.

• mobrovac mentioned this in T125345: Many error 500 from pageviews API "Error in Cassandra table storage backend".Feb 2 2016, 5:31 PM

I looked a little bit into ringpop today. It only supports node <= 0.10 currently, and even when using that (via nvm) the examples don't actually work.

The main issue for 4.2 compatibility is a relatively long list of incompatible binary dependencies:

I think it's worth looking for alternatives before investing more time into ringpop.

https://github.com/kadtools/kad and related tools look promising. This is a simple extensible DHT framework, with UDP transports and some layered functionality like the pub-sub. The examples work out of the box, and a custom decaying counter key-value store looks rather simple to implement based on this. But, we'd need to invest a bit of time to optimize this for throughput and latency.

A basic rate limiter is now available at https://github.com/gwicke/limitation. Performance is decent, with thousands of keys, 15 DHT nodes and ~100k req/s handled in a single process.

I'm proposing to integrate this into service-runner with the following strategy:

Set up a shared rate limiter instance in the master node, using the default port 3050 or a random fall-back port (already handled in kad-ratelimiter).
Have a copy of counters and blocks in each worker, and perform all checks / increments in-process.
Periodically, push counters to the master process. The master process periodically updates with DHT & sends updated blocks back to workers.

Connecting only the master process to the DHT avoids an excessive number of DHT nodes, and side-steps the issues around cluster. Based on testing, stability and load should not be an issue.

PR available at https://github.com/wikimedia/service-runner/pull/89, plus a small one for hyperswitch: https://github.com/wikimedia/hyperswitch/pull/12 to forward the ratelimiter property.

• GWicke claimed this task.Feb 24 2016, 5:35 AM

• GWicke removed a project: Services-next.

Both PRs have now been merged.

Additionally, a hyperswitch request filter was added in https://github.com/wikimedia/hyperswitch/pull/20. We plan to test this in log-only mode next week, after which we will start to use this to rate-limit specific entry points in RESTBase.

I am resolving this tasks. Further improvements can be handled in follow-up tasks.

@Pchelolo found this useful listing of rate limiting related headers on stackoverflow. The "-remaining" headers in particular make it relatively easy for clients to pace themselves.

However, implementing these without introducing synchronous check overheads looks a bit tricky. While each limitation instance has a full set of current counter values available for all keys it has received requests for, there are challenges from

the delay inherent in asynchronous counter exchanges, and
incomplete counter sets for very low-volume entry points.

It might be possible to still solve this with heuristics, but this will need some more thinking.

Add rate limiter functionality to service-runnerClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add rate limiter functionality to service-runner
Closed, ResolvedPublic
Actions

Related Objects
Search...