Page MenuHomePhabricator

Implement global ratelimiting in our service mesh
Open, In Progress, HighPublic

Description

To curb the load on mw-api-int caused by the search update pipeline's fetch operator, search would like a global rate limit (not one per worker as that leads to long tails and unused "quota").

Rate-limiting is a long-wanted feature for the MW API (internal) anyways, see T248543. Service/Ops is willing to discuss implementing it the envoy-way: Envoy supports local and remote/distributed rate limits, as described here. The least invasive approach to test this would be the following:

  • set up an envoy rate-limit service (backed by redis)
  • configure the client-side sidecar envoys to use that rate-limit service

This avoids unnecessary network traffic leaving the pod.

With that setup/configuration in place, the fetch operator must handle HTTP 429 responses gracefully, by retrying, but with a shorter, non-growing delay unlike regular retries.

api-gateway already uses a combination of ratelimit (the standard implementation for global rate limiting from envoy) and redis-misc (via nutcracker). In that setup, ratelimit is running as a sidecar alongside the api-gateway envoy.

For the mesh ratelimit, we decided to provide a central ratelimit service via it's own chart and deployment that can be used by all service mesh envoys and may hold multiple rate limit configurations (domains) for different use-cases. The initial rate limit configuration should allow 1k/rps per user-agent as that is easy enough to distinguish and we encourage mw-api client to properly identify themselves anyways.

For this MVP implementation the mesh "clients" should be able to opt-in to being rate limited via configuration values, the proposed implementation/configuration structure can be found at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1028558

There also is an initial dashboard available showing the metrics the ratelimit service exposes (via the statsd exporter, as it does not support native prometheus metrics): https://grafana.wikimedia.org/d/bf921591-bd2b-4a87-ae20-7cc6f227e58a/jayme-ratelimit

Event Timeline

pfischer renamed this task from SUP rate limits for fetch to SUP rate-limit fetch.Apr 11 2024, 10:16 AM
pfischer updated the task description. (Show Details)
Gehel triaged this task as High priority.Apr 15 2024, 1:14 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
pfischer changed the task status from Open to Stalled.Apr 18 2024, 4:02 PM
pfischer added a subscriber: JMeybohm.

I brought up this discussion with @JMeybohm and as it turns out, rate-limiting is a long-wanted feature for the MW API anyways, see T248543. Service/Ops is willing to discuss implementing it the envoy-way: Envoy supports local and remote/distributed rate limits, as described here. The least invasive approach to test this would be the following:

  • set up an envoy rate-limit service (backed by redis)
  • configure the client-side sidecar envoys to use that rate-limit service

This avoids unnecessary network traffic leaving the pod.

With that setup/configuration in place, the fetch operator must handle HTTP 429 responses gracefully, by retrying, but with a shorter, non-growing delay unlike regular retries.

pfischer changed the task status from Stalled to In Progress.Thu, May 2, 12:46 PM

Change #1026563 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] New chart from scaffold: ratelimit

https://gerrit.wikimedia.org/r/1026563

Change #1026564 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add new chart: ratelimit

https://gerrit.wikimedia.org/r/1026564

Change #1026859 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] New version of base.certificates module

https://gerrit.wikimedia.org/r/1026859

Change #1026860 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Make base.certificates compatible with chart modules and scaffold

https://gerrit.wikimedia.org/r/1026860

Change #1028532 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] ratelimit: Update ratelimit service to git 3fcc360

https://gerrit.wikimedia.org/r/1028532

Change #1028557 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add new mesh.configuration version

https://gerrit.wikimedia.org/r/1028557

Change #1028558 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh.configuration: Add support for rate limiting

https://gerrit.wikimedia.org/r/1028558

Change #1028532 merged by JMeybohm:

[operations/docker-images/production-images@master] ratelimit: Update ratelimit service to git 3fcc360

https://gerrit.wikimedia.org/r/1028532

Change #1026859 merged by jenkins-bot:

[operations/deployment-charts@master] New version of base.certificates module

https://gerrit.wikimedia.org/r/1026859

Change #1026860 merged by jenkins-bot:

[operations/deployment-charts@master] Make base.certificates compatible with chart modules and scaffold

https://gerrit.wikimedia.org/r/1026860

Change #1029205 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/envoyproxy/ratelimiter@master] Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1029205

JMeybohm renamed this task from SUP rate-limit fetch to Implement global ratelimiting in our service mesh.Wed, May 8, 4:20 PM
JMeybohm claimed this task.
JMeybohm edited projects, added serviceops; removed serviceops-radar.
JMeybohm updated the task description. (Show Details)
JMeybohm added subscribers: hnowlan, akosiaris.

Change #1029205 merged by JMeybohm:

[operations/software/envoyproxy/ratelimiter@master] Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1029205

Change #1032293 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] ratelimit: Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1032293

Change #1032293 merged by JMeybohm:

[operations/docker-images/production-images@master] ratelimit: Add CertProvider to hot reload TLS certs for gRPC service

https://gerrit.wikimedia.org/r/1032293

Successfully published image docker-registry.discovery.wmnet/ratelimit:9.0.2-20240503.3fcc360, supporting hot reload of gRPC certs. This should unblock deploying the ratelimit service.