Thumbor's use of poolcounter is rate limiting Kubernetes IPs
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	hnowlan
	Jun 19 2023, 3:43 PM

Description

Across all containers on eqiad for example:

27513 thumbor-ip-10.64.32.134
27523 thumbor-ip-10.64.48.229
27905 thumbor-ip-10.64.16.189
28460 thumbor-ip-10.64.32.135

While mediawiki uses poolcounter for rate limiting internal IPs, Thumbor is in theory supposed to use it only for external IPs. It's fairly clear that issues like T338765 are being caused by this as when a request for a private wiki thumbnail is rejected, the error message includes a Kubernetes worker IP address key used to check poolcounter.

I would propose that we add an mechanism for excluding rate limiting internal IP addresses unless we want to keep this behaviour for internal IPs. Should we be using x-client-ip instead of x-forwarded-for?

It's worth noting that Thumbor explicitly uses X-Forwarded-For for this purpose (splitting it on commas and selecting the first element), and so something very odd is happening with these requests given that all 4 IPs are Kubernetes hosts.

All of the Kubernetes hosts above are running changeprop, but that's not a guaranteed link as most kubernetes workers are. However, lots of requests from changeprop seem likely in light of T337649.

Details

	Subject	Repo	Branch	Lines +/-
	poolcounter: introduce allowlist to skip rate limit	operations/software/thumbor-plugins	master	+10 -0
	thumbor: don't set x-forwarded-for at haproxy level	operations/deployment-charts	master	+1 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hnowlan	T337649 Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad
Open	Joe	T338297 Revisit thumbor's poolcounter integration
Open	None	T339863 Thumbor's use of poolcounter is rate limiting Kubernetes IPs

Event Timeline

hnowlan created this task.Jun 19 2023, 3:43 PM

hnowlan updated the task description. (Show Details)Jun 19 2023, 3:48 PM

hnowlan updated the task description. (Show Details)Jun 19 2023, 3:51 PM

Change 931592 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: don't set x-forwarded-for at haproxy level

https://gerrit.wikimedia.org/r/931592

gerritbot added a project: Patch-For-Review.Jun 20 2023, 12:20 PM

However, lots of requests from changeprop seem likely in light of T337649.

Do you mean jobrunners (mw appservers, PHP) or changeprop itself? If changeprop is fetching upload.wikimedia.org/Thumbor, for what service or stream is it doing that?

Krinkle added a project: Performance-Team (Radar).Jun 20 2023, 1:15 PM

In T339863#8948816, @Krinkle wrote:

However, lots of requests from changeprop seem likely in light of T337649.

Do you mean jobrunners (mw appservers, PHP) or changeprop itself? If changeprop is fetching upload.wikimedia.org/Thumbor, for what service or stream is it doing that?

To be clear I mean the ThumbnailRender job in changeprop-jobqueue (~~so requests will be coming from k8s~~). There's been a lot of batch uploads of PDFs recently which are spurring long backlogs of jobs. The concurrency on some of these were causing issues but we've limited some of that impact.

Part of the complication here is that up until recently only 4 k8s hosts were pooled in pybal as opposed to the full 20+ hosts that are pooled now. These 4 IPs were being set in the x-forwarded-for header (somehow) and that's why only 4 specific hosts show up cited in the ticket. This means we'll be distributing the load across a wider list of hosts but the bug is still present. Trying to nail down where this is being added. ngrep on the k8s hosts themselves show the x-ff list looking as expected.

hnowlan mentioned this in T338765: Image 429 errors for most images on private wikis.Jun 20 2023, 2:36 PM

akosiaris moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Jul 6 2023, 3:23 PM

Krinkle mentioned this in T341666: Wikimedia\RequestTimeout\RequestTimeoutException on de:Holomorphe_Funktion and several other math-heavy articles.Jul 13 2023, 4:28 PM

Krinkle removed a project: Performance-Team (Radar).Aug 6 2023, 10:27 PM

Krinkle unsubscribed.

MarkTraceur added a project: Structured-Data-Backlog (Current Work).Mar 4 2024, 5:34 PM

MarkTraceur moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

Change 931592 abandoned by Hnowlan:

[operations/deployment-charts@master] thumbor: don't set x-forwarded-for at haproxy level

Reason:

https://gerrit.wikimedia.org/r/931592

Maintenance_bot removed a project: Patch-For-Review.Mar 7 2024, 4:30 PM

MarkTraceur edited projects, added Structured-Data-Backlog; removed Structured-Data-Backlog (Current Work).May 29 2024, 4:24 PM

MarkTraceur moved this task from Triage to Tracking on the Structured-Data-Backlog board.

hnowlan moved this task from Doing 😎 to 🛎 Services & Oids on the serviceops board.May 30 2024, 2:09 PM

Change #1063217 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] poolcounter: introduce allowlist to skip rate limit

https://gerrit.wikimedia.org/r/1063217

gerritbot added a project: Patch-For-Review.Oct 7 2024, 10:04 AM

Thumbor's use of poolcounter is rate limiting Kubernetes IPsOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Thumbor's use of poolcounter is rate limiting Kubernetes IPs
Open, Needs TriagePublic
Actions

Related Objects
Search...