Page MenuHomePhabricator

Thumbor's use of poolcounter is rate limiting Kubernetes IPs
Open, Needs TriagePublic

Description

Across all containers on eqiad for example:

27513 thumbor-ip-10.64.32.134
27523 thumbor-ip-10.64.48.229
27905 thumbor-ip-10.64.16.189
28460 thumbor-ip-10.64.32.135

While mediawiki uses poolcounter for rate limiting internal IPs, Thumbor is in theory supposed to use it only for external IPs. It's fairly clear that issues like T338765 are being caused by this as when a request for a private wiki thumbnail is rejected, the error message includes a Kubernetes worker IP address key used to check poolcounter.

I would propose that we add an mechanism for excluding rate limiting internal IP addresses unless we want to keep this behaviour for internal IPs. Should we be using x-client-ip instead of x-forwarded-for?

It's worth noting that Thumbor explicitly uses X-Forwarded-For for this purpose (splitting it on commas and selecting the first element), and so something very odd is happening with these requests given that all 4 IPs are Kubernetes hosts.

All of the Kubernetes hosts above are running changeprop, but that's not a guaranteed link as most kubernetes workers are. However, lots of requests from changeprop seem likely in light of T337649.

Event Timeline

Change 931592 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: don't set x-forwarded-for at haproxy level

https://gerrit.wikimedia.org/r/931592

However, lots of requests from changeprop seem likely in light of T337649.

Do you mean jobrunners (mw appservers, PHP) or changeprop itself? If changeprop is fetching upload.wikimedia.org/Thumbor, for what service or stream is it doing that?

However, lots of requests from changeprop seem likely in light of T337649.

Do you mean jobrunners (mw appservers, PHP) or changeprop itself? If changeprop is fetching upload.wikimedia.org/Thumbor, for what service or stream is it doing that?

To be clear I mean the ThumbnailRender job in changeprop-jobqueue (so requests will be coming from k8s). There's been a lot of batch uploads of PDFs recently which are spurring long backlogs of jobs. The concurrency on some of these were causing issues but we've limited some of that impact.

Part of the complication here is that up until recently only 4 k8s hosts were pooled in pybal as opposed to the full 20+ hosts that are pooled now. These 4 IPs were being set in the x-forwarded-for header (somehow) and that's why only 4 specific hosts show up cited in the ticket. This means we'll be distributing the load across a wider list of hosts but the bug is still present. Trying to nail down where this is being added. ngrep on the k8s hosts themselves show the x-ff list looking as expected.

Change 931592 abandoned by Hnowlan:

[operations/deployment-charts@master] thumbor: don't set x-forwarded-for at haproxy level

Reason:

https://gerrit.wikimedia.org/r/931592

Change #1063217 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] poolcounter: introduce allowlist to skip rate limit

https://gerrit.wikimedia.org/r/1063217