Page MenuHomePhabricator

Test api rate limiting on production cluster
Closed, ResolvedPublic8 Estimated Story Points

Description

Deploy rate limiting in shadow mode (dry run) for the rest gateway for collecting stats.

  • Phase 0: prepare infrastructure
    • enabled: true: add ratelimit service and dependencies
    • shadow_mode: false: (default)
    • default_policy: experiment-2025 (default)
    • user_id_cookie: "" (default)
    • fallback_class: no-limit: avoid accidental enforcment of rate limits
    • require_opt_in: true: disable rate limiting per default
    • confirm that the ratelimit and nutcracker containers have been added to the gateway pod
    • use curl to confirm that rate limiting does not apply on any route
    • use envoy's admin interface to check that no requests are being made to the ratelimit service
  • Phase 1: enable manual testing of limit enforcement
    • allow_client_headers: true to enable manual testing
    • enable_x_ratelimit_headers: true to enable x-ratelimit headers in the response (T408839)
    • apply_rate_limiting: true on some routes, to activate rate limiting
      • editor-analytics: /api/rest_v1/metrics/editors/ and /api/rest_v1/registered-users/ (~ 1 req/sec)
      • wikifeeds: /api/rest_v1/page/random/ and /api/rest_v1/page/feed/ (~200 req/sec, top user > 12k req/hour)
    • Use curl to test that no rate limits are applied on /api/rest_v1/metrics/editors/ and /api/rest_v1/page/random/
      • set the User-Agent header to something useful that points to this ticket.
      • test anon_limit with 500 req/hour is not enforced
      • check that there are no x-ratelimit headers in the response
    • Use curl to test that rate limits are applied if the x-wmf-user-id and x-wmf-user-class headers are set
      • test anon_limit with 500 req/hour is enforced for x-wmf-user-class: anon
      • test default_limit with 5000 req/hour is enforced for x-wmf-user-class: cookie-user
      • check the values of the x-ratelimit headers in the response
  • Phase 2: test shadow mode
    • change shadow_mode: true: to enable global shadow mode
    • use curl to check that rare limits are no longer enforced on /api/rest_v1/metrics/editors/and /api/rest_v1/page/random/
      • confirm that we are still getting x-ratelimit headers in the response
  • Phase 3: enable shadow mode limits for all users on certain routes
    • user_id_cookie: "centralauth_User" so rate limiting is per user name (insecure)
    • fallback_class: anon so rate limits are enforced for unauthenticated users
    • allow_client_headers: false to prevent clients from overriding limits
    • use curl to check that rare limits are not enforced on any route
    • use envoy's admin interface to confirm that requests are being made to the rate limiter
    • monitor ratelimiter metrics (T408183), confirm that we are seeing the "over limit" count go up (expect >10,000 per hour from the top user of /api/rest_v1/page/random/, compare Turnilo data)
  • Phase 4: enable shadow mode limits on all routes
    • require_opt_in: false: to turn on rate limiting for everything
    • flip apply_rate_limiting to false on editor-analytics and wikifeeds routes.
    • monitor redis resource consumption
    • monitor ratelimit metrics
    • confirm that the opt-out works and no rate limits are applied on wikifeeds routes (check headers).
    • monitor redis resource consumption (eqiad/codfw). Should level off after one hour.
  • Phase 5: disable x-ratelimit headers and remove opt-out.
    • enable_x_ratelimit_headers: false: to disable rate limit headers
    • remove apply_rate_limiting from all routes
    • confirm that we are no longer sending x-ratelimit headers

Event Timeline

daniel set the point value for this task to 8.
daniel set Due Date to Oct 17 2025, 10:00 PM.

Candiate routes for enabling rate limiting...

Very low traffic:

  • /api/rest_v1/page/pdf/ (< 1 req/sec)
  • /api/rest_v1/page/talk/ (< 1 req/sec)
  • /api/rest_v1/data/recommendation/ (< 1 req/sec)

Quite low traffic

  • /api/rest_v1/transform/ (~15 req/sec, but the top users has 8k req/hour)
  • /api/rest_v1/page/random/ (~55 req/sec, but the top users has 9k req/hour)
daniel updated the task description. (Show Details)
daniel triaged this task as High priority.Nov 6 2025, 9:01 AM

Change #1202647 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: enable rate limit infrastructure, enforce no limits

https://gerrit.wikimedia.org/r/1202647

Change #1202654 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: enable rate limit infrastructure, allow manual testing

https://gerrit.wikimedia.org/r/1202654

Change #1202658 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: enable rate limits on some routes in shadow mode

https://gerrit.wikimedia.org/r/1202658

Change #1202647 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: enable rate limit infrastructure, enforce no limits

https://gerrit.wikimedia.org/r/1202647

Change #1202654 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: enable rate limit infrastructure, allow manual testing

https://gerrit.wikimedia.org/r/1202654

Change #1202658 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: enable rate limits on some routes in shadow mode

https://gerrit.wikimedia.org/r/1202658

Change #1203848 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: enable shadow mode limits on nearly all routes

https://gerrit.wikimedia.org/r/1203848

Change #1203843 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: enable rate limit in shadow mode on some routes

https://gerrit.wikimedia.org/r/1203843

Change #1203849 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] rest-gateway: clean up test config

https://gerrit.wikimedia.org/r/1203849

Change #1203843 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: enable rate limit in shadow mode on some routes

https://gerrit.wikimedia.org/r/1203843

Change #1203848 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: enable shadow mode limits on nearly all routes

https://gerrit.wikimedia.org/r/1203848

Change #1203849 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: clean up test config

https://gerrit.wikimedia.org/r/1203849

daniel closed this task as Resolved.EditedNov 12 2025, 12:02 PM

This is done, @Clement_Goubert deployed API rate limiting for all users of all APIs today (in "shadow mode", so nothing is enforced). We are now collecting rate limiting metrics for all requests to the API.

Impact stats:

  • This affects about 7k requests per second (currently 4.5 in cofw and 2.5 in eqiad).
  • About 8% of the traffic is authenticated browser (has a entralauth_User cookie)
  • About 35% of anon traffic exceeds the current limit of 500 req/hour
  • About 25% of authenticated traffic exceeds the current limit of 5000 req/hour

Performance stats:

  • Rate limiting adds about 10ms to the p99 of request latency, and about 5ms to the p50.
  • Visible impact on redis servers (number of operations increaded by 100%), but resource usage is well within limits
  • Resource usae and performance of the rate limiting service and Envoy gateway is well within limits overall

Issues:

  • On codfw, a low but significant number of requests from the gateway to the rate limiter time out (about 10 req/s "fail open", so no rate limiting is applied). This affects less than 1% of requests but it not satisfactory and warrants further investigation.

Next steps:

  • Add support for API keys (T405578)
  • Gather per-user metrics (T407999)