Page MenuHomePhabricator

Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails
Open, MediumPublic

Description

When pages with many thumbnails like https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_as_of_26_August_2018&filefrom=Starr-100901-8896-Dubautia+linearis-habitat-Kanaio+Natural+Area+Reserve-Maui+%2824419916404%29.jpg%0AStarr-100901-8896-Dubautia+linearis-habitat-Kanaio+Natural+Area+Reserve-Maui+%2824419916404%29.jpg#mw-category-media and https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Collection/Bavarian_State_Painting_Collections/18th_Century haven't been visited for a while you don't get all the thumbnails. A lot of the thumbnails will fail with an error like:
Request from **** via cp3061 cp3061, Varnish XID 971825121
Upstream caches: cp3061 int
Error: 429, Too Many Requests at Wed, 21 Oct 2020 16:03:11 GMT

Event Timeline

T252426: Lower per-IP PoolCounter throttling Thumbor settings set the per-IP new thumbnail generation throttle to no more than 4 thumbnails being generated simultaneously for any one IP. Any additional requests up to 50 are queued for up to 1 second, after which they are dropped. This limit is enforced somewhat fuzzily due to real-world complexities, but "no more than 50 new thumbnails per page load" is a good rule of thumb.

This limit only applies to new thumbnails that have not been previously generated and cached. Thumbnails that have already been generated are served from the cache or file storage and are not affected by the same limits. This means that refreshing the page will often generate most of the remaining thumbnails.

This throttle was previously set higher (500 in 8 seconds), but was lowered due to an outage. While I don't see the limit being increased to previous levels, there may be an intermediate level that maintains service stability but allows more thumbnails to be generated at once.

Looking forward, the two ways to actually fix this problem would be to improve the Thumbor throttling system (T252749) or to expand the use of lazy loading in MediaWiki to slow down the request rate on pages with many thumbnails.

AntiCompositeNumber renamed this task from Frequent "Error: 429, Too Many Requests" errors on pages with thumbnails to Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails .Oct 21 2020, 9:22 PM

@Gilles how did you come up with the number 50? The standard number of thumbnails in a Commons category is 200 and it's very common for galleries to have more thumbnails than that. I've hit this limit many times over the last couple of months. This is a source of much annoyance.

The limit is related to the amount of cores we have rendering thumbnails in production. There's a fundamental issue in the current throttling mechanism, in that each "queued" request from a single user can lock up a Thumbor process. That's due to Thumbor being single-threaded and us running it on bare metal, where the amount of Thumbor processes is pre-determined and fixed. Running more processes than we have cores would be unhelpful, because during normal operations you could end up with concurrent requests slowing each other down. Instead we prefer to queue things, so that the first user to get a free core gets it exclusively so that their thumbnail is rendered as fast as possible.

There's no existing load-balancing software that does everything we need to fix this problem, which would include keeping queues for each user at the load balancer level, to ensure that users who request a lot of thumbnails get them eventually. The current system, while imperfect, ensures that only users requesting an unreasonable amount of thumbnails at once get throttled, while everyone else still gets the thumbnails they've requested.

In the current situation, it's not reasonable for a single user to request hundreds of fresh thumbnails at once. Commons needs to reduce the default amount of thumbnails on those pages if you're hitting the limit frequently. And as @AntiCompositeNumber mentioned, lazy-loading needs to be added to pages where it's not applied yet. I think that those are sane things to do in general in a context like ours where new thumbnails are generated on the spot on demand, which is very unusual on the web.

When Thumbor gets migrated to Kubernetes at some point this fiscal year, I will attempt to fix this throttling limitation. I'm hoping that the orchestration of Kubernetes will allow us to start as many new Thumbor processes as the amount of processes that get locked on the Poolcounter throttling. Essentially, extra requests from a "greedy" user will get locked waiting on Poolcounter telling them that they can go ahead, meanwhile the orchestrator would detect that those Thumbor processes aren't processing requests at the moment and would spin up new ones temporarily. Then once the spike is over, the amount of concurrent processes can be scaled back down. This flexible pool of processes is what would allow us to have a proper queue per user, if you will, meaning that greedy users would get potentially hundreds of requests queued with no damage done to the overall processing capacity. This idea is theoretical at the moment, we'll have to see if it can work in practice. But I think it's the simplest way to accommodate for our use case without having to write a completely custom load balancer.

Thanks for your explanation Gilles. You can see the bug in action at https://commons.wikimedia.org/wiki/Special:NewFiles

Many other limits are higher for logged in users or users with higher rights then for not logged in users. Would it be possible to do the same here?

Seeing that it heavily affects Special:NewFiles with 50 thumbnails, it seems like the current limits aren't being respected. When the new limit was put in place, it was tuned to keep Special:NewFiles working fine with up to 50 fresh thumbnails. Right now it's clearly failing a lot of them.

I also notice that the 429 responses may be emitted by Varnish (int-front response). Maybe this is related to the recent Varnish 6 upgrade.

@ema can you confirm that int-front cache status in the response means that the 429 was emitted by Varnish? From one of those: https://github.com/wikimedia/puppet/blob/338c1bd746aedf5c7ea7303cf31c64f30b9fee93/modules/varnish/templates/upload-frontend.inc.vcl.erb#L205

Cannot confirm. int-front means that the response was generated by Varnish, but it does not mean that the 429 response comes from Varnish itself. For example, we generate custom error pages if the response from the origin has no body: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#1051

I've confirmed that it is the PoolCounter throttle from Thumbor, by hitting it myself (that's my own ipv6 address):

Screenshot 2020-10-23 at 14.31.20.png (198×688 px, 34 KB)

On a Special:NewPages load I get about 11 thumbnails rendered, and 17 429s. I don't think that the issue is the queue size of 50, but it's the timeout of 1 second (it used to be 8 seconds). I'll increase it back to 4 to see if it's enough for Special:NewFiles to get a much higher success rate.

Change 636012 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Increase timeout of Thumbor per-ip throttling

https://gerrit.wikimedia.org/r/636012

Gilles triaged this task as Medium priority.

Change 636012 merged by Ema:
[operations/puppet@production] Increase timeout of Thumbor per-ip throttling

https://gerrit.wikimedia.org/r/636012

The new timeout is in place. It seems to help Special:NewFiles on Commons to a degree but still doesn't avoid 429s entirely. Unfortunately we can't go back to the pre-outage values that worked better, as the increased concurrency for that scenario contributed to the outage.

One workaround I can think of in the current setup would be to change the haproxy load-balancing algorithm from "first" to hashing by X-Forwarded-For value (which contains the client IP).

This means that each IP address will only be given one Thumbor process. The main drawback of this approach is that it's leaving available processing power unused for a given user if there is any. I.e. instead of being able to effectively use 4 concurrent processes at the moment (and locking up to 50 more...), a single user will only be able to leverage one at a time, but won't lock any. In fact the PoolCounter per-IP lock should never trigger anymore.

In order to ensure at least equivalent reliability for a given user as we have now, we'll need to at least quadruple the haxproxy queue timeout (currently set at 10 seconds). This will essentially determine how old the oldest request can be, i.e. the maximum waiting period to get its ticket in line for the one Thumbor process it's going to be routed to.

I think it's worth trying as an experiment. It might introduce other problems that I can't foresee right now, but it's worth a shot. I think it's appealing that it moves the per-IP queueing responsibility to haproxy instead of having it handled by the ill-fitted Poolcounter throttle that keeps many processes locked and unusable.

Change 636024 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Switch Thumbor haproxy load balancing to IP hash

https://gerrit.wikimedia.org/r/636024

The other drawback of that workaround I can think of is that it will re-introduce some drive-by casualties for "greedy" users. For example, if your IP address hashes to the same process as a user who's just requested 50 thumbnails on it, then tough luck, you'll have to wait for that user's thumbnails to be rendered before your first one is, even if there are other Thumbor processes sitting idle in the meantime.

So it won't be all positive for everyone, it's a "pick your poison" tradeoff, but maybe waiting is preferable to 429s. The question is whether users will hit 5xx errors due to the 40s per-process haproxy queue timeout less often than they're hitting 429s at the moment. Which we should be able to see on the existing dashboards.

Multichill added subscribers: Keegan, Ramsey-WMF.

I just noticed this also breaks https://commons.wikimedia.org/wiki/Special:MediaSearch if you sort it by "recency".

Haven't looked into the proposed work arounds/solutions. From a user point of view I wouldn't mind slow loading thumbnails, I just hate getting the broken thumbnails.

Change 638109 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] swift: pass the 'X-Client-IP' header to thumbor

https://gerrit.wikimedia.org/r/638109

Change 638109 merged by Effie Mouzeli:
[operations/puppet@production] swift: pass the 'X-Client-IP' header to thumbor

https://gerrit.wikimedia.org/r/638109

Peachey88 renamed this task from Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails to Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails.Mar 31 2021, 8:51 PM

Resetting inactive assignee account