So far, we have primarily been using //rate// limiting to protect API end points from abuse. In practice, we have found that rates are not the best thing to focus on, both from the user & our perspective. In this task, I am making the case for focusing on //request concurrency// instead.
For API users, rate limits are difficult to understand and implement. Most clients are implemented as one or more basic loops iterating over URLs. The actual rate of requests made can change drastically based on response times and network conditions, which is hard for clients to predict. Implementing actual rate limiting and -pacing is non-trivial, and I believe very rare in practice. Because they are implemented as simple loops, the vast majority of our clients actually already limit //concurrency//, and not rates.
On the server side, the main thing we care about is limiting the resources a single client can tie up. The time (and thus CPU / memory / IO resources) needed to serve individual API requests can differ by several orders of magnitude. Some requests can be very cheap when served from caches, but a lot more expensive when not in cache. Even within a single API entry point like the one we expose for ad-hoc wikitext parsing, costs can differ wildly depending on inputs. However, to a first approximation, each concurrent request is tying up a relatively similar amount of resources while it is being processed. This means that the //request concurrency// per client approximates the associated resource usage a lot better than request rates.
Additionally, concurrency-limited clients will automatically slow down during times of temporarily elevated latencies, which helps reduce load when it is most expensive to our infrastructure.
## Implementing concurrency limiting
Our nginx / varnish layer is critical for performance and reliability. For this reason, we would much prefer making limiting decisions using local nginx / varnish state only, avoiding dependencies on other services subject to failures and network latency.
- Nginx has [concurrency limiting on arbitrary, templated keys](http://nginx.org/en/docs/http/ngx_http_limit_conn_module.html) (IP, header, [etc](http://nginx.org/en/docs/varindex.html))
- Problem: Would need to map all connections from a given API user to the same Nginx instance for effective rate limiting. Can't do this with LVS if we don't want to be limited to limiting on IPs.
- Varnish handles all analytics needs, so limiting in Varnish would make it easier to accurately capture limiting in analytics.
### Idea: Balance nginx->varnish connections by app key hash in Nginx; concurrency limit in front-end Varnish.
- Load balance backend connections from Nginx to Varnish by app-level $client_id; don't use LVS
- [ngx_http_upstream_consistent_hash module](https://www.nginx.com/resources/wiki/modules/consistent_hash/): consistent hash load balancing on arbitrary key
- https://github.com/weibocom/nginx-upsync-module: adds dynamic backends from etcd
- Could later use [auth JWTs](https://nginx.org/en/docs/http/ngx_http_auth_jwt_module.html) to verify client ids & derive a reliable key
- Implement concurrency limiting in Varnish. Options:
- Extend [vsthrottle module](https://github.com/varnish/varnish-modules/blob/master/src/vmod_vsthrottle.c) with ability to return tokens at end of request, as discussed in [this task](https://github.com/varnish/libvmod-vsthrottle/issues/8). This looks fairly straightforward to implement.
- Create a simplified counter module loosely modeled on vsthrottle, based on [atomic counters](https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/_005f_005fatomic-Builtins.html) & a periodic GC process. Should offer better performance, but might be YAGNI.