The purpose of this ticket is to discuss and track the evolution of the API rate limiting architecture to be put in place in FY25-26 as part of KR5.1 (see also internal design document).
Work packages (hypothesis):
Introduction and goals
Requirements
- Apply per-client rate limiting to API requests
- Define surrogate user IDs for anonymous requests
- Shares limit counter across sites, backends, and endpoints
- Support cost-based limits (alternatively, implement concurrency limits)
- Separate counters for different time periods (seconds, minutes) for burst control.
Assumptions
- No requirement to share counters across data-centers
- To protect from load spikes, we don’t need long-term limits (beyond per-minute)
- Occasional loss of state is acceptable (effectively doubling the rate limit for one block)
- Intermittent loss of functionality is acceptable for a few seconds, if it only affects less than ¼ or clients (TBC).
- Sliding windows are preferred, fixed windows are ok.
- Requests served from the cache (Varnish, ATS) do not need to be considered, they don’t count against the rate limit since they do not consume significant resources.
Stakeholders
- MediaWiki Platform Team owns the implementation of rate limiting and authentication in API gateway
- MediaWiki Interfaces Team owns API behavior
- Service Operations Team needs protection from load spikes and owns operation of API gateway
- Traffic Team owns first-tier rate limiting (by IP) and scraper detection
Quality goals
- scale to up to 100k req/sec
- 99.999% availability
Constraints
Prefer technologies that we already have in production:
- Envoy for the API gateway
- Kubernetes/Helm for deployment
- Redis for storage (if any)
Context and scope
This work is part of KR5.1, see FY25–26 WE5.1: Q1 Hypotheses . It is closely related to WE5.4 and some of the work in 5.2. \[TBD: links\!\]
For many API calls, we do not know who controls them (the operator). It seems that certain operators consume an unfair amount of resources (CPU time, file handles, etc). If we crack down on scraping the website (WE5.4, see Scraping), this problem is likely to become worse when scrapers start using APIs like PCS instead.
To address this issue, we should implement improved authentication (WE5.1.2) and apply per-user rate limiting for API calls. This means we will end up having three tiers of rate limiting:
- Per-IP limits applied by haproxy at the edge (existing) across everything, including page views and cache hits. These serve primarily as protection from DDoS and aggressive scraping. These limits should be permissive for requests with valid authentication. Limits are controlled by SRE.
- Per-user API rate limits (new), across sites and backends. Their purpose is to ensure fair usage of resources by individual users/clients and account for the fact that some requests are much more expensive than others.
- Application-based rate limiting per user and operation (existing). These primarily exist in MediaWiki to protect the community from overwhelming edits.
The scope of this document is the rate limiting mechanism in the API gateway, how it integrates with adjacent layers, and how it ties in with user authentication. This is needed because both IP-based and application level rate limiting are insufficient:
IP based rate limiting is easy to bypass for an actor with access to a large network of computers, be it their own or a rented bot net. Conversely, restrictive IP based rate limits will negatively impact legitimate use in situations where a large number of users shares a small number of public IP addresses, such as a college campus or some ISPs, especially in the global south. IP based rate limiting is however extremely efficient and an indispensable tool for handling DDoS attacks.
Application level rate limiting provides fine grained control on a per-user basis, but it has to be implemented for each application. Since we have several more applications besides MediaWiki that are high value targets for scraping (like the Page Content Service) or very expensive (like e.g. the WIkidata query service, but also some action API calls), this approach would lead to overhead and inconsistencies. Application level rate limits are however still useful, especially for MediaWiki, to provide fine grained control on how much activity the community is expected to patrol and review.
Solution strategy (tentative)
- Limit is to be applied by an API gateway (Envoy) located between ATS and application servers
- Use Envoy’s standard “global rate limiting” approach: call a standalone rate limiting service (RLS cluster) to determine whether a request is allowed to go through.
- The user’s ID and tier comes from a JWT if present. For anonymous requests, the CDN should supply a surrogate ID.
- Shared-nothing: Keep rate-limiting counters local to each RLS pod, either directly in-memory inside the process, or in a side-car (Redis).
- If the shared-nothing architecture should not be feasible, we can fall back to having a single Redis instance for all counters. This would be a single point of failure though. A write/write replicated technology would be preferred, but scaling to 100k writes/sec may be a challenge.
- To enable a shared-nothing architecture, make use of Envoy’s consistent hashing load-balancing mechanism for the cluster of rate limiters, to ensure that the same user always uses the same rate limiter instance.
- Question: can we use Kubernetes’ “stateful set” feature to define a cluster of RLS pods that can be addressed individually? Can we rely on Kubernetes DNS? Can it update fast enough?
- Concurrency control is approximated using Envoy’s “usage based rate limiting” feature (aka “cost” aka “hits\_added”). This means bumping limits twice, once on the way in (request) with a base cost, then on the way out (response) with an additional cost. A request’s cost is (for now) simply the time it took on the upstream server (x-envoy-upstream-service-time).
- The applicable limits are determined based on a user “class” included in the JWT. Classes would be something like: *unidentified, anon visitor, authenticated user, authorized bot*.