Page MenuHomePhabricator

Implement global ratelimiting in our service mesh
Closed, ResolvedPublic

Description

To curb the load on mw-api-int caused by the search update pipeline's fetch operator, search would like a global rate limit (not one per worker as that leads to long tails and unused "quota").

Rate-limiting is a long-wanted feature for the MW API (internal) anyways, see T248543. Service/Ops is willing to discuss implementing it the envoy-way: Envoy supports local and remote/distributed rate limits, as described here. The least invasive approach to test this would be the following:

  • set up an envoy rate-limit service (backed by redis)
  • configure the client-side sidecar envoys to use that rate-limit service

This avoids unnecessary network traffic leaving the pod.

With that setup/configuration in place, the fetch operator must handle HTTP 429 responses gracefully, by retrying, but with a shorter, non-growing delay unlike regular retries.

api-gateway already uses a combination of ratelimit (the standard implementation for global rate limiting from envoy) and redis-misc (via nutcracker). In that setup, ratelimit is running as a sidecar alongside the api-gateway envoy.

For the mesh ratelimit, we decided to provide a central ratelimit service via it's own chart and deployment that can be used by all service mesh envoys and may hold multiple rate limit configurations (domains) for different use-cases. The initial rate limit configuration should allow 1k/rps per user-agent as that is easy enough to distinguish and we encourage mw-api client to properly identify themselves anyways.

For this MVP implementation the mesh "clients" should be able to opt-in to being rate limited via configuration values, the proposed implementation/configuration structure can be found at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1028558

There also is an initial dashboard available showing the metrics the ratelimit service exposes (via the statsd exporter, as it does not support native prometheus metrics): https://grafana.wikimedia.org/d/bf921591-bd2b-4a87-ae20-7cc6f227e58a/jayme-ratelimit

I tried to condense the above into https://wikitech.wikimedia.org/wiki/Ratelimit

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+3 -0
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+6 -6
operations/deployment-chartsmaster+12 -2
operations/deployment-chartsmaster+286 -128
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+103 -0
operations/puppetproduction+4 -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+115 -30
operations/deployment-chartsmaster+610 -0
operations/deployment-chartsmaster+358 -426
operations/deployment-chartsmaster+1 K -0
operations/deployment-chartsmaster+115 -30
operations/docker-images/production-imagesmaster+8 -1
operations/software/envoyproxy/ratelimitermaster+136 -22
operations/deployment-chartsmaster+52 -30
operations/deployment-chartsmaster+41 -0
operations/docker-images/production-imagesmaster+14 -7
Show related patches Customize query in gerrit
TitleReferenceAuthorSource BranchDest Branch
Use distinct HTTP user-agentrepos/search-platform/cirrus-streaming-updater!127pfischerdedicated-user-agentmain
Customize query in GitLab

Event Timeline

pfischer renamed this task from SUP rate limits for fetch to SUP rate-limit fetch.Apr 11 2024, 10:16 AM
pfischer updated the task description. (Show Details)
Gehel triaged this task as High priority.Apr 15 2024, 1:14 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
pfischer changed the task status from Open to Stalled.Apr 18 2024, 4:02 PM
pfischer added a subscriber: JMeybohm.

I brought up this discussion with @JMeybohm and as it turns out, rate-limiting is a long-wanted feature for the MW API anyways, see T248543. Service/Ops is willing to discuss implementing it the envoy-way: Envoy supports local and remote/distributed rate limits, as described here. The least invasive approach to test this would be the following:

  • set up an envoy rate-limit service (backed by redis)
  • configure the client-side sidecar envoys to use that rate-limit service

This avoids unnecessary network traffic leaving the pod.

With that setup/configuration in place, the fetch operator must handle HTTP 429 responses gracefully, by retrying, but with a shorter, non-growing delay unlike regular retries.

pfischer changed the task status from Stalled to In Progress.May 2 2024, 12:46 PM

Change #1026563 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] New chart from scaffold: ratelimit

https://gerrit.wikimedia.org/r/1026563

Change #1026564 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add new chart: ratelimit

https://gerrit.wikimedia.org/r/1026564

Change #1026859 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] New version of base.certificates module

https://gerrit.wikimedia.org/r/1026859

Change #1026860 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Make base.certificates compatible with chart modules and scaffold

https://gerrit.wikimedia.org/r/1026860

Change #1028532 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] ratelimit: Update ratelimit service to git 3fcc360

https://gerrit.wikimedia.org/r/1028532

Change #1028557 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add new mesh.configuration version

https://gerrit.wikimedia.org/r/1028557

Change #1028558 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh.configuration: Add support for rate limiting

https://gerrit.wikimedia.org/r/1028558

Change #1028532 merged by JMeybohm:

[operations/docker-images/production-images@master] ratelimit: Update ratelimit service to git 3fcc360

https://gerrit.wikimedia.org/r/1028532

Change #1026859 merged by jenkins-bot:

[operations/deployment-charts@master] New version of base.certificates module

https://gerrit.wikimedia.org/r/1026859

Change #1026860 merged by jenkins-bot:

[operations/deployment-charts@master] Make base.certificates compatible with chart modules and scaffold

https://gerrit.wikimedia.org/r/1026860

Change #1029205 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/envoyproxy/ratelimiter@master] Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1029205

JMeybohm renamed this task from SUP rate-limit fetch to Implement global ratelimiting in our service mesh.May 8 2024, 4:20 PM
JMeybohm claimed this task.
JMeybohm edited projects, added serviceops; removed serviceops-radar.
JMeybohm updated the task description. (Show Details)
JMeybohm added subscribers: hnowlan, akosiaris.

Change #1029205 merged by JMeybohm:

[operations/software/envoyproxy/ratelimiter@master] Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1029205

Change #1032293 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] ratelimit: Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1032293

Change #1032293 merged by JMeybohm:

[operations/docker-images/production-images@master] ratelimit: Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1032293

Successfully published image docker-registry.discovery.wmnet/ratelimit:9.0.2-20240503.3fcc360, supporting hot reload of gRPC certs. This should unblock deploying the ratelimit service.

Change #1034896 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh.configuration: Add support for rate limiting

https://gerrit.wikimedia.org/r/1034896

Change #1034896 abandoned by JMeybohm:

[operations/deployment-charts@master] mesh.configuration: Add support for rate limiting

Reason:

Messed up rebase

https://gerrit.wikimedia.org/r/1034896

Change #1026563 merged by JMeybohm:

[operations/deployment-charts@master] New chart from scaffold: ratelimit

https://gerrit.wikimedia.org/r/1026563

Change #1026564 merged by JMeybohm:

[operations/deployment-charts@master] Add new chart: ratelimit

https://gerrit.wikimedia.org/r/1026564

Change #1028557 merged by jenkins-bot:

[operations/deployment-charts@master] Add new mesh.configuration version

https://gerrit.wikimedia.org/r/1028557

Change #1028558 merged by jenkins-bot:

[operations/deployment-charts@master] mesh.configuration: Add support for rate limiting

https://gerrit.wikimedia.org/r/1028558

Change #1039626 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mesh: publish mesh.configuration 1.8

https://gerrit.wikimedia.org/r/1039626

Change #1039626 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: publish mesh.configuration 1.8

https://gerrit.wikimedia.org/r/1039626

Change #1039726 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh: Update mesh.name dependency to mesh.configuration:1.8

https://gerrit.wikimedia.org/r/1039726

Change #1039726 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: Update mesh.name dependency to mesh.configuration:1.8

https://gerrit.wikimedia.org/r/1039726

Change #1039727 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] flink-app: Update various modules

https://gerrit.wikimedia.org/r/1039727

Change #1040060 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add ratelimit user to the wikikube/main clusters

https://gerrit.wikimedia.org/r/1040060

Change #1040086 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Create ratelimit namespace and releases

https://gerrit.wikimedia.org/r/1040086

Change #1040060 merged by JMeybohm:

[operations/puppet@production] Add ratelimit user to the wikikube/main clusters

https://gerrit.wikimedia.org/r/1040060

Change #1040086 merged by jenkins-bot:

[operations/deployment-charts@master] Create ratelimit namespace and releases

https://gerrit.wikimedia.org/r/1040086

Change #1040096 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Don't deploy istio certificate for the ratelimit service

https://gerrit.wikimedia.org/r/1040096

Change #1040096 merged by jenkins-bot:

[operations/deployment-charts@master] Don't deploy istio certificate for the ratelimit service

https://gerrit.wikimedia.org/r/1040096

Change #1040105 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ratelimit: Don't deploy latest

https://gerrit.wikimedia.org/r/1040105

Change #1040105 merged by jenkins-bot:

[operations/deployment-charts@master] ratelimit: Don't deploy latest

https://gerrit.wikimedia.org/r/1040105

Change #1040114 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ratelimit: Ensure LOCAL_CACHE_SIZE_IN_BYTES is not converted

https://gerrit.wikimedia.org/r/1040114

Change #1040114 merged by jenkins-bot:

[operations/deployment-charts@master] ratelimit: Ensure LOCAL_CACHE_SIZE_IN_BYTES is not converted

https://gerrit.wikimedia.org/r/1040114

Change #1040138 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ratelimit: Allow ingress on the HTTP port for easier testing

https://gerrit.wikimedia.org/r/1040138

Change #1039727 merged by jenkins-bot:

[operations/deployment-charts@master] flink-app: Update various modules

https://gerrit.wikimedia.org/r/1039727

The ratelimit service has been deployed to staging and prod wikikube clusters.
What's left to be done is to configure cirrus-streaming-updater to use it (see https://wikitech.wikimedia.org/wiki/Ratelimit#Enable/opt_in_to_rate_limiting). From all the values files I'm not sure which components (all?) should be rate limited, so I'd like to leave that change to you @pfischer / @bking / @dcausse. Feel free to send it my way for review/sync with me for the deployment so we can verify everything works as expected.

Change #1040211 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: enable rate limiting

https://gerrit.wikimedia.org/r/1040211

Change #1040211 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: enable rate limiting

https://gerrit.wikimedia.org/r/1040211

Change #1043059 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ratelimit: Use LOG_LEVEL warn by default

https://gerrit.wikimedia.org/r/1043059

Change #1043059 merged by jenkins-bot:

[operations/deployment-charts@master] ratelimit: Use LOG_LEVEL warn by default

https://gerrit.wikimedia.org/r/1043059

Change #1043125 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ratelimit: Increase CPU limit and set GOMAXPROCS everywhere

https://gerrit.wikimedia.org/r/1043125

Change #1043125 merged by jenkins-bot:

[operations/deployment-charts@master] ratelimit: Increase CPU limit and set GOMAXPROCS everywhere

https://gerrit.wikimedia.org/r/1043125

Change #1043667 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus::k8s: Keep envoy ratelimit metrics

https://gerrit.wikimedia.org/r/1043667

Change #1043667 merged by JMeybohm:

[operations/puppet@production] prometheus::k8s: Keep envoy ratelimit metrics

https://gerrit.wikimedia.org/r/1043667

Change #1046591 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: use dedicated user agents

https://gerrit.wikimedia.org/r/1046591

Change #1046591 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: use dedicated user agents

https://gerrit.wikimedia.org/r/1046591

Change #1047538 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: retry 429s at HTTP client level

https://gerrit.wikimedia.org/r/1047538

Change #1047538 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: retry 429s at HTTP client level

https://gerrit.wikimedia.org/r/1047538

Change #1048083 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: use rate-limited HTTP client

https://gerrit.wikimedia.org/r/1048083

Change #1048083 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: use rate-limited HTTP client

https://gerrit.wikimedia.org/r/1048083

I'd say that from our end this is done. Feel free to reopen if you feel like something is missing

Change #1051358 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: reduce client-side rate-limit

https://gerrit.wikimedia.org/r/1051358

Change #1051412 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] CHANGELOG for configuration 1.8.0

https://gerrit.wikimedia.org/r/1051412

Change #1051412 merged by jenkins-bot:

[operations/deployment-charts@master] CHANGELOG for configuration 1.8.0

https://gerrit.wikimedia.org/r/1051412

Change #1051358 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: reduce client-side rate-limit

https://gerrit.wikimedia.org/r/1051358