Page MenuHomePhabricator

Add rate limiter functionality to service-runner
Closed, ResolvedPublic

Description

Several services could benefit from a global rate limiter service. Similar to statsd an logging, this kind of service would be a good fit in service-runner.

In the past, @Pchelolo has investigated & tested redis-based rate limiter libraries. The downside of those is a) complexity of needing to host & maintain another storage system, and b) a single point of failure.

An alternative is to use a DHT like Uber's ringpop for rate limiting. Their hyperbahn project actually implements a rate limiter on top of ringpop, so it might be possible to borrow most of the code.

The downside of DHTs like ringpop is that nodes running a particular service need to communicate, which means that at least one node IP needs to be communicated across the cluster. However, this requirement is not very different from redis.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: service-runner.
GWicke added subscribers: GWicke, Pchelolo.

I looked a little bit into ringpop today. It only supports node <= 0.10 currently, and even when using that (via nvm) the examples don't actually work.

The main issue for 4.2 compatibility is a relatively long list of incompatible binary dependencies:

I think it's worth looking for alternatives before investing more time into ringpop.

https://github.com/kadtools/kad and related tools look promising. This is a simple extensible DHT framework, with UDP transports and some layered functionality like the pub-sub. The examples work out of the box, and a custom decaying counter key-value store looks rather simple to implement based on this. But, we'd need to invest a bit of time to optimize this for throughput and latency.

A basic rate limiter is now available at https://github.com/gwicke/limitation. Performance is decent, with thousands of keys, 15 DHT nodes and ~100k req/s handled in a single process.

I'm proposing to integrate this into service-runner with the following strategy:

  • Set up a shared rate limiter instance in the master node, using the default port 3050 or a random fall-back port (already handled in kad-ratelimiter).
  • Have a copy of counters and blocks in each worker, and perform all checks / increments in-process.
  • Periodically, push counters to the master process. The master process periodically updates with DHT & sends updated blocks back to workers.

Connecting only the master process to the DHT avoids an excessive number of DHT nodes, and side-steps the issues around cluster. Based on testing, stability and load should not be an issue.

PR available at https://github.com/wikimedia/service-runner/pull/89, plus a small one for hyperswitch: https://github.com/wikimedia/hyperswitch/pull/12 to forward the ratelimiter property.

Both PRs have now been merged.

Additionally, a hyperswitch request filter was added in https://github.com/wikimedia/hyperswitch/pull/20. We plan to test this in log-only mode next week, after which we will start to use this to rate-limit specific entry points in RESTBase.

I am resolving this tasks. Further improvements can be handled in follow-up tasks.

@Pchelolo found this useful listing of rate limiting related headers on stackoverflow. The "-remaining" headers in particular make it relatively easy for clients to pace themselves.

However, implementing these without introducing synchronous check overheads looks a bit tricky. While each limitation instance has a full set of current counter values available for all keys it has received requests for, there are challenges from

  • the delay inherent in asynchronous counter exchanges, and
  • incomplete counter sets for very low-volume entry points.

It might be possible to still solve this with heuristics, but this will need some more thinking.