Page MenuHomePhabricator

Reliable and scaleable rate limiting mechanism for RESTBase API entry points
Closed, DuplicatePublic

Description

RESTBase API entry points are currently not rate-limited in any way. While generally exposing only cheap entry points is probably the best defense against (accidental) DOS attacks, we will have some entry points or backend services that will be expensive enough to warrant rate-limiting. As a per-entry point configuration option, it might make sense to support specifying limits in the swagger security stanza per entry point.

One issue with rate limiting is the identification of particular clients. IPs are a relatively poor option, as a large number of users (think schools or universities) can use a single NAT setup. IP-based limits can't be very tight for this reason. Better user identification can be provided for authenticated requests. However, the processing of authenticated requests is currently not very efficient (see RFC).

A separate concern is the reliability and performance of the rate limiting solution itself. It should ideally impose a minimal CPU time and latency overhead on the regular request processing. Additionally, it should scale well to a very large number of requests, and should fail gracefully without breaking ongoing requests.

Candidate backends

See also

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke triaged this task as Medium priority.Aug 4 2015, 8:23 PM
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke set Security to None.
GWicke updated the task description. (Show Details)
GWicke edited subscribers, added: mobrovac, Pchelolo; removed: Aklapper.

From general research it seems Redis-backed solutions are the best, so I propose to use it. I've compared rate-limiting libs available in node:

  1. https://www.npmjs.com/package/ratelimiter - this one seems to have the most adoption, but it doesn't survive Redis failures well - it start printing unnecessary messages to console and doesn't seem to be stable when Redis fails, I'm seeing randomly hunting requests. So I'm excluding it from the following analysis at stability is #1 requirement. Also it's slower then the next on.
  1. https://www.npmjs.com/package/redis-rate-limiter - this is the second by adoption, survives Redis failure really well, but performance degrades a bit when Redis fails.
  1. https://www.npmjs.com/package/rolling-rate-limiter - the least adopted lib, but it has many nice features: sliding window, so we can't have a burst of 1000 request at second 0.99 and 1000 requests more at 1.01, also it supports minimum delay between consecutive requests.

Both libs support custom user identification, multiple limiters in a single Redis instance, both provide similar Redis failure handling API. For additional reliability of Redis we could wrap a connection to a curcuit-breaker (like https://github.com/ryanfitz/node-circuitbreaker) to avoid roundtrips to dead Redis on every request.

For performance comparation I've integrated the libs with RESTbase and benchmarked them with local Redis running and dead Redis. Overall, we don't have significant performance impact, here are the results for 100K requests backed by SQLite:

working redis, req/sdead redis, req/s
plain RESTBase416416
redis-rate-limiter410.5395
rolling-rate-limiter377395

The case with alive Redis is the most common, so the main data is the second column of the table. To sum up, the question if whether we want speed (redis-rate-limiter) or more features (rolling-rate-limiter)?

One more piece of data: I've implemented a draft version of limiter and turned on limiting on every request (including internal requests). This has a huge performance penalty: from 448 req/s to 369 req/s. It's ~21.5% performance degradation and it's definitely not acceptable. In case only one request in a chain is limited, we get performance about 421 req/s, which is better (~6.5% degradation), but that still has room for improvement.

Data above is valid for localhost redis install, if we have a shared redis server somewhere on the network, limiting impact could grow significantly.

@Pchelolo have you considered doing rate-limiting per rb-instance, thus eliminating the need of an external redis server? Since we distribute requests with a weighted round robin, each server of the same size can receive the same amount of requests, so the rate-limiting should just need to be divided by that value.

@Joe investigating in-memory limiting options now, will have some results soon. The same lib I used for Redis could be used as an in-mem solution, but there could be better libs around.

Some more data: I've investigated in-memory options, found 2 which suit our needs:

  1. https://www.npmjs.com/package/rolling-rate-limiter - benchmark shows 425 req/s
  2. https://www.npmjs.com/package/limiter - benchmark shows 420 req/s

Baseline: 448 req/s

So, with a local Redis, redis-backed solutions are even slightly faster, than in-memory, but if Redis is shared and not local, performance drops dramatically (I've made a test with remote Redis on VPS in Frankfurt, but it's not showing anything - too far away, but it's like 20 req/s).

I think we have enough data to make a decision on the technology to proceed. I personally vote for an in-memory solution with a rolling-rate-limiter lib for each individual server, as requests are shared in a predictable fashion, so Redis doesn't give us any value, only headache with installation, robustness, and possibly big performance impact. @GWicke, @mobrovac, @Eevans, what are your thoughts on this?

We discussed this on IRC this morning. The overheads for the remote options seem to be significant enough to make it unattractive to enable limits globally.

A more efficient solution would likely make per-request decisions locally, while propagating information on keys accessed and keys to block asynchronously (possibly with local thresholding to reduce traffic). This would decouple Redis performance from request latency, and avoids Redis becoming the bottleneck in the cluster. It might even be possible to use a gossip protocol instead, somewhat similar to this Yahoo system (sadly no open source implementation).

More research seems to be needed for a good longer-term solution. For now, we can simply defer implementing this until it is really needed.