Page MenuHomePhabricator

Update api-gateway ratelimit version
Closed, ResolvedPublic

Description

The API gateway's version of the ratelimit service is very old (using the 1.5.1 branch). We have started to see issues where the service is throwing 5xx errors and there is very little way to debug why this is happening. We're also seeing this manifesting when other services are impaired, which is confusing and a bad signal, especially when it pages.

We should update the service to a recent version, and as part of this migration we should switch to using the new prometheus metrics it offers. In addition we can remove the statsd gateway as part of this work.

Event Timeline

jasmine_ added a subscriber: Jasmine.
jasmine_ removed a subscriber: Jasmine.

Related - it'd be nice if this work could get us some logging enhancements. We saw in T390215 that the current version's debug logging (which is the only log level that we can currently use to get useful output) was crushing logstash, and the default log levels are far too quiet when it comes to failures to connect etc, so if there have been improvements in the last 4 (!) years that would be nice.

Change #1165475 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] api-gateway: use more recent ratelimit image

https://gerrit.wikimedia.org/r/1165475

Change #1165475 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: use more recent ratelimit image

https://gerrit.wikimedia.org/r/1165475

Change #1166221 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] api-gateway: use latest build of ratelimit service in prod

https://gerrit.wikimedia.org/r/1166221

Change #1166221 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: use latest build of ratelimit service in prod

https://gerrit.wikimedia.org/r/1166221

Change #1166412 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/docker-images/production-images@master] ratelimit: bump version number

https://gerrit.wikimedia.org/r/1166412

Change #1166790 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] api-gateway: use ratelimit's inbuilt promethus-statsd agent

https://gerrit.wikimedia.org/r/1166790

Change #1166412 merged by Hnowlan:

[operations/docker-images/production-images@master] ratelimit: bump version number

https://gerrit.wikimedia.org/r/1166412

Change #1166790 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: use ratelimit's inbuilt promethus-statsd agent

https://gerrit.wikimedia.org/r/1166790

We're now using the latest HEAD of the ratelimit service on bullseye and have removed the prometheus-statsd-exporter from the api-gateway deployment.