Page MenuHomePhabricator

Epic: API rate limiting dry run (WE5.1.3b)
Open, MediumPublic

Description

Based on the minikube setup (T398915), create a set of Helm templates for configurations that will allow us to deploy the candidate solutions identified by T398917 in the production environment, to collect stats (dry run) based on real life traffic.

The goal is to evaluate the selected technolgoes as well as to determine baselines for future rate limiting rules.

Properties to be observed:

  • added latency
  • does it even work
  • cpu load (gatewy, limiter, storage)
  • memory consumption (gatewy, limiter, storage)
  • service discovery and recovery (in particular, envoy getting new IPs for nodes in a StatefulSet after they restart).

Note: Ideally we would test with (some percentage of) real traffic, in dry run mode. But asa first step it would already be useful to test with generated or recorded traffic on a separate, internal instance of the api gateway (backed by the real app server or a dummy, tbd).

Note: For the evaluation, it would be sufficient to determin the user name or ID based on a coookie, even if that is not truested. We also only need to test two tiers of users: anon (with IP address as identity) and logged in (with the ID from the cookie). We can add suppotr for JWT later.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Opendaniel
DeclinedClement_Goubert
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolvedpmiazga
Resolvedhnowlan
Resolveddaniel
Resolveddaniel
ResolvedClement_Goubert
Resolveddaniel
Resolvedpmiazga
Resolveddaniel
Resolvedpmiazga
Resolveddaniel
Resolvedbrennen
Resolveddaniel
ResolvedClement_Goubert
Resolveddaniel
Resolveddaniel
DeclinedNone
OpenNone
ResolvedJgiannelos
StalledNone
InvalidNone
Resolveddaniel
Resolveddaniel
Resolveddaniel
OpenSLyngshede-WMF
Resolvedtaavi
Resolvedtaavi
ResolvedNone
ResolvedClement_Goubert
ResolvedClement_Goubert

Event Timeline

daniel triaged this task as Medium priority.Jul 16 2025, 2:33 PM
daniel moved this task from Incoming (Needs Triage) to Backlog on the MW-Interfaces-Team board.

I played with envoy locally a bit: Here is a configuration that demonstrates how to detect the user's ID from a cookie, and use it to drive rate limiting: P80982.

The config also demponstrates how to do cost based rate limiting based on the time it takes the upstream server to respond.

Note to self: @Clement_Goubert said in our meeting today that we should probably start testing with a separate instance of the REST Gateway and send synthetic or recorded traffic to it. See T402914: Set up a rest-gateway deployment for rate limiting testing

daniel renamed this task from Determine how to test API rate limiting with real traffic to Test API rate limiting in a production environment.Aug 27 2025, 2:16 PM
daniel updated the task description. (Show Details)

Just jotting down some notes based on the conversation @daniel and me had earlier today :)

Specific areas of focus for testing:

  • Want to observe how much traffic would be rejected (without rejecting it)
  • Test spin up/spin down processes
  • Determine how many instances should be created
    • Memory consumption & CPU load relative to traffic

General order of dependencies:

  • Biggest blocker is currently the Envoy version upgrade. Production currently runs on v1.23, with work in progress to upgrade to 1.26. To effectively test this work, we require a minimum of v1.33 (with 1.35 desired).
  • Soft block/limitation based on rerouting the existing APIs through the new Gateway. Rerouting will allow us to measure and monitor real traffic.

Need more clarity on:

  • What we want to measure or observe for auth. What level of granularity is expected?
    • Should we compare anonymous vs logged in? Wait for JWT?

Finally, just wanted to update with what our timeline looks like for rerouting the endpoints through the common gateway, so y'all have it:

  • Today/tomorrow: Dial test2wiki.wikimedia.org up to 100% of traffic going through new gateway.
  • Now through Sept 8: Define and review detailed test plan, perhaps with some preliminary or exploratory manual testing (https://phabricator.wikimedia.org/T400152)
  • Sept 9-23: Internal testing
  • Sept 23- Oct 3: Community testing -- During this period, we will dial traffic through the gateway up to 100% for test wiki, so that the community can execute calls and let us know if anything seems wonky.
  • Week of Oct 6: General rollout across wikis

Also, if y'all haven't already, you might want to talk to @SLong-WMF on the Quality Services team. Their input and guidance is really helpful. We are working very closely with them to help us create a good test plan, and they might be able to help you too!

daniel renamed this task from Test API rate limiting in a production environment to Epic: Test API rate limiting in a production environment (WE5.1.3b).Nov 4 2025, 10:28 AM
daniel renamed this task from Epic: Test API rate limiting in a production environment (WE5.1.3b) to Epic: API rate limiting dry run (WE5.1.3b).Nov 6 2025, 8:58 AM