Page MenuHomePhabricator

Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails
Open, MediumPublic

Description

When pages with many thumbnails like https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_as_of_26_August_2018&filefrom=Starr-100901-8896-Dubautia+linearis-habitat-Kanaio+Natural+Area+Reserve-Maui+%2824419916404%29.jpg%0AStarr-100901-8896-Dubautia+linearis-habitat-Kanaio+Natural+Area+Reserve-Maui+%2824419916404%29.jpg#mw-category-media and https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Collection/Bavarian_State_Painting_Collections/18th_Century haven't been visited for a while you don't get all the thumbnails. A lot of the thumbnails will fail with an error like:
Request from **** via cp3061 cp3061, Varnish XID 971825121
Upstream caches: cp3061 int
Error: 429, Too Many Requests at Wed, 21 Oct 2020 16:03:11 GMT

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

T252426: Lower per-IP PoolCounter throttling Thumbor settings set the per-IP new thumbnail generation throttle to no more than 4 thumbnails being generated simultaneously for any one IP. Any additional requests up to 50 are queued for up to 1 second, after which they are dropped. This limit is enforced somewhat fuzzily due to real-world complexities, but "no more than 50 new thumbnails per page load" is a good rule of thumb.

This limit only applies to new thumbnails that have not been previously generated and cached. Thumbnails that have already been generated are served from the cache or file storage and are not affected by the same limits. This means that refreshing the page will often generate most of the remaining thumbnails.

This throttle was previously set higher (500 in 8 seconds), but was lowered due to an outage. While I don't see the limit being increased to previous levels, there may be an intermediate level that maintains service stability but allows more thumbnails to be generated at once.

Looking forward, the two ways to actually fix this problem would be to improve the Thumbor throttling system (T252749) or to expand the use of lazy loading in MediaWiki to slow down the request rate on pages with many thumbnails.

AntiCompositeNumber renamed this task from Frequent "Error: 429, Too Many Requests" errors on pages with thumbnails to Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails .Oct 21 2020, 9:22 PM

@Gilles how did you come up with the number 50? The standard number of thumbnails in a Commons category is 200 and it's very common for galleries to have more thumbnails than that. I've hit this limit many times over the last couple of months. This is a source of much annoyance.

The limit is related to the amount of cores we have rendering thumbnails in production. There's a fundamental issue in the current throttling mechanism, in that each "queued" request from a single user can lock up a Thumbor process. That's due to Thumbor being single-threaded and us running it on bare metal, where the amount of Thumbor processes is pre-determined and fixed. Running more processes than we have cores would be unhelpful, because during normal operations you could end up with concurrent requests slowing each other down. Instead we prefer to queue things, so that the first user to get a free core gets it exclusively so that their thumbnail is rendered as fast as possible.

There's no existing load-balancing software that does everything we need to fix this problem, which would include keeping queues for each user at the load balancer level, to ensure that users who request a lot of thumbnails get them eventually. The current system, while imperfect, ensures that only users requesting an unreasonable amount of thumbnails at once get throttled, while everyone else still gets the thumbnails they've requested.

In the current situation, it's not reasonable for a single user to request hundreds of fresh thumbnails at once. Commons needs to reduce the default amount of thumbnails on those pages if you're hitting the limit frequently. And as @AntiCompositeNumber mentioned, lazy-loading needs to be added to pages where it's not applied yet. I think that those are sane things to do in general in a context like ours where new thumbnails are generated on the spot on demand, which is very unusual on the web.

When Thumbor gets migrated to Kubernetes at some point this fiscal year, I will attempt to fix this throttling limitation. I'm hoping that the orchestration of Kubernetes will allow us to start as many new Thumbor processes as the amount of processes that get locked on the Poolcounter throttling. Essentially, extra requests from a "greedy" user will get locked waiting on Poolcounter telling them that they can go ahead, meanwhile the orchestrator would detect that those Thumbor processes aren't processing requests at the moment and would spin up new ones temporarily. Then once the spike is over, the amount of concurrent processes can be scaled back down. This flexible pool of processes is what would allow us to have a proper queue per user, if you will, meaning that greedy users would get potentially hundreds of requests queued with no damage done to the overall processing capacity. This idea is theoretical at the moment, we'll have to see if it can work in practice. But I think it's the simplest way to accommodate for our use case without having to write a completely custom load balancer.

Thanks for your explanation Gilles. You can see the bug in action at https://commons.wikimedia.org/wiki/Special:NewFiles

Many other limits are higher for logged in users or users with higher rights then for not logged in users. Would it be possible to do the same here?

Seeing that it heavily affects Special:NewFiles with 50 thumbnails, it seems like the current limits aren't being respected. When the new limit was put in place, it was tuned to keep Special:NewFiles working fine with up to 50 fresh thumbnails. Right now it's clearly failing a lot of them.

I also notice that the 429 responses may be emitted by Varnish (int-front response). Maybe this is related to the recent Varnish 6 upgrade.

@ema can you confirm that int-front cache status in the response means that the 429 was emitted by Varnish? From one of those: https://github.com/wikimedia/puppet/blob/338c1bd746aedf5c7ea7303cf31c64f30b9fee93/modules/varnish/templates/upload-frontend.inc.vcl.erb#L205

Cannot confirm. int-front means that the response was generated by Varnish, but it does not mean that the 429 response comes from Varnish itself. For example, we generate custom error pages if the response from the origin has no body: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#1051

I've confirmed that it is the PoolCounter throttle from Thumbor, by hitting it myself (that's my own ipv6 address):

Screenshot 2020-10-23 at 14.31.20.png (198×688 px, 34 KB)

On a Special:NewPages load I get about 11 thumbnails rendered, and 17 429s. I don't think that the issue is the queue size of 50, but it's the timeout of 1 second (it used to be 8 seconds). I'll increase it back to 4 to see if it's enough for Special:NewFiles to get a much higher success rate.

Change 636012 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Increase timeout of Thumbor per-ip throttling

https://gerrit.wikimedia.org/r/636012

Gilles triaged this task as Medium priority.

Change 636012 merged by Ema:
[operations/puppet@production] Increase timeout of Thumbor per-ip throttling

https://gerrit.wikimedia.org/r/636012

The new timeout is in place. It seems to help Special:NewFiles on Commons to a degree but still doesn't avoid 429s entirely. Unfortunately we can't go back to the pre-outage values that worked better, as the increased concurrency for that scenario contributed to the outage.

One workaround I can think of in the current setup would be to change the haproxy load-balancing algorithm from "first" to hashing by X-Forwarded-For value (which contains the client IP).

This means that each IP address will only be given one Thumbor process. The main drawback of this approach is that it's leaving available processing power unused for a given user if there is any. I.e. instead of being able to effectively use 4 concurrent processes at the moment (and locking up to 50 more...), a single user will only be able to leverage one at a time, but won't lock any. In fact the PoolCounter per-IP lock should never trigger anymore.

In order to ensure at least equivalent reliability for a given user as we have now, we'll need to at least quadruple the haxproxy queue timeout (currently set at 10 seconds). This will essentially determine how old the oldest request can be, i.e. the maximum waiting period to get its ticket in line for the one Thumbor process it's going to be routed to.

I think it's worth trying as an experiment. It might introduce other problems that I can't foresee right now, but it's worth a shot. I think it's appealing that it moves the per-IP queueing responsibility to haproxy instead of having it handled by the ill-fitted Poolcounter throttle that keeps many processes locked and unusable.

Change 636024 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Switch Thumbor haproxy load balancing to IP hash

https://gerrit.wikimedia.org/r/636024

The other drawback of that workaround I can think of is that it will re-introduce some drive-by casualties for "greedy" users. For example, if your IP address hashes to the same process as a user who's just requested 50 thumbnails on it, then tough luck, you'll have to wait for that user's thumbnails to be rendered before your first one is, even if there are other Thumbor processes sitting idle in the meantime.

So it won't be all positive for everyone, it's a "pick your poison" tradeoff, but maybe waiting is preferable to 429s. The question is whether users will hit 5xx errors due to the 40s per-process haproxy queue timeout less often than they're hitting 429s at the moment. Which we should be able to see on the existing dashboards.

Multichill added subscribers: Keegan, Ramsey-WMF.

I just noticed this also breaks https://commons.wikimedia.org/wiki/Special:MediaSearch if you sort it by "recency".

Haven't looked into the proposed work arounds/solutions. From a user point of view I wouldn't mind slow loading thumbnails, I just hate getting the broken thumbnails.

Change 638109 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] swift: pass the 'X-Client-IP' header to thumbor

https://gerrit.wikimedia.org/r/638109

Change 638109 merged by Effie Mouzeli:
[operations/puppet@production] swift: pass the 'X-Client-IP' header to thumbor

https://gerrit.wikimedia.org/r/638109

Peachey88 renamed this task from Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails to Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails.Mar 31 2021, 8:51 PM

Resetting inactive assignee account

Change 636024 abandoned by Effie Mouzeli:

[operations/puppet@production] Switch Thumbor haproxy load balancing to IP hash

Reason:

not relevant any more

https://gerrit.wikimedia.org/r/636024

Change 883570 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/core@master] Lists of images should use lazy loading

https://gerrit.wikimedia.org/r/883570

The above patch should cause native lazy loading of images by the browser. This will cause fewer initial requests to the thumbor server on those specific page types with a lot of images. Fewer requests means a lower request rate average, which should decrease the chance that people encounter the rate limit.

This is not (yet) done for normal wikipages, because we have fewer reported problems there and because mobilefrontend uses JS for lazy loading of which we don't entirely know how those two systems will interact (errors are unlikely, but perceived behavior might be strange). The pages converted here are much less critical in mobile.

Test wiki on Patch demo by TheDJ using patch(es) linked to this task was deleted:

https://patchdemo.wmflabs.org/wikis/45fe6c92a9/w/

Change 883570 merged by jenkins-bot:

[mediawiki/core@master] Lists of images should use lazy loading

https://gerrit.wikimedia.org/r/883570

@doctaxon There are basically 2 main causes for 429 errors as a “normal” user of the website, but both have the same meaning: The servers are saying: "you are asking me to do too much work within too short a timeframe, please go away and try again at a later time."

In part you can blame this on bad actors who continuously try to hack and harass wikipedia and other wikimedia services, which has required the system administrators to put stricter limits in place to make sure the websites over all are able to stay up for everyone (damage limitation).

  1. One cause is this ticket. You are requesting something like a gallery (a page with a lot of thumbnails) for which many of the thumbnails will not yet have been generated ever before. So this causes the system to have to produce like 40+ fresh thumbnails at once. Each thumbnail takes at least a second but sometimes up to 10 seconds to produce. Together, it is too much, but the website doesn't fully handle this situation in the nicest of ways. However, generally if you later return to this same view of the page you will see more and more of the thumbnails available to you.
  1. Repeated failures. When thumbnails that are requested continuously and/or repeatedly fail to be generated and/or returned by the internal systems, the outer layer of the wikimedia website will also send 429 errors. Basically "we tried this a lot, its not working, we can't help you any further with this right now". So an error that is actually on top of another error (root cause). This can be for a variety of reasons, but you can identify this is the case, because generally the exact same issue for the same thumbnail url occurs even the next day. This can have dozen and dozens of reasons and really needs to be judged on a case by case basis

@TheDJ thanks for your comment. These 429 errors "needs to be judged on case by case" means it's necessary to open a bug report for everyone of these errors? Thirty hours later now the error still exists. And I think the thumbnail on https://commons.wikimedia.org/wiki/File:%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png has to be generated already at least one time when the file had been uploaded.

@TheDJ thanks for your comment. These 429 errors "needs to be judged on case by case" means it's necessary to open a bug report for everyone of these errors? Thirty hours later now the error still exists. And I think the thumbnail on https://commons.wikimedia.org/wiki/File:%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png has to be generated already at least one time when the file had been uploaded.

Yes, separate issues should be filed for those, they have nothing to do with this ticket.

In this particular case, the error is ImageMagickException: Failed to convert image convert: IDAT: invalid distance too far back. Which means that the image is pointing somewhere outside the range of the file, meaning that the file is broken. Apparently this file used to work up to 2016'ish, but after that time the PNG library became stricter (to deal with potential security errors).

Generally downloading it, processing it with a tool like pngcrush and re-uploading should solve that. Which I have done.

Change 975088 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/core@master] Fix lazy loading for ImageListPager

https://gerrit.wikimedia.org/r/975088

Change 975088 merged by jenkins-bot:

[mediawiki/core@master] Fix lazy loading for ImageListPager and File history

https://gerrit.wikimedia.org/r/975088

@TheDJ thanks for your comment. These 429 errors "needs to be judged on case by case" means it's necessary to open a bug report for everyone of these errors? Thirty hours later now the error still exists. And I think the thumbnail on https://commons.wikimedia.org/wiki/File:%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png has to be generated already at least one time when the file had been uploaded.

Yes, separate issues should be filed for those, they have nothing to do with this ticket.

In this particular case, the error is ImageMagickException: Failed to convert image convert: IDAT: invalid distance too far back. Which means that the image is pointing somewhere outside the range of the file, meaning that the file is broken. Apparently this file used to work up to 2016'ish, but after that time the PNG library became stricter (to deal with potential security errors).

Generally downloading it, processing it with a tool like pngcrush and re-uploading should solve that. Which I have done.

I understand, that there are many possibilitys why a thumb is not displayed. But as a user you get either 429 or 500 es response, or a broken image icon (if the 429 or 500 happens for an image in a page (gallery, category, MediaViewer). But on the server the reason is known: PNg-vulnerability or else.

It would make a much better UX, if instead of the generic 429 or 500 response, the actual reason would be visible to the user, it would allow easier error reports. or error reports would not be needed at all, if the user can do something themselves (correcting the vulnerable PNG or else). In the case of pages (gallery, category) the server could send a replacement image that contains the text of the actual error.

It would make a much better UX, if instead of the generic 429 or 500 response, the actual reason would be visible to the user, it would allow easier error reports. or error reports would not be needed at all, if the user can do something themselves (correcting the vulnerable PNG or else). In the case of pages (gallery, category) the server could send a replacement image that contains the text of the actual error.

I agree. Anyway, these are different causes. I've created T353950: [Tracking] Thumbnail 429 rate limiting on failed requests of complex or broken media files to keep track of these various issues that have to do with the problem of large and/or complex or repeating failures of individual files.

Just trying to think up solutions - if thumbor gives a 429, could varnish instead send an (uncached) redirect to the original file [if not huge] or some standard size (like the size that would be on the image description page so is likely to already be rendered)

It seems like a graceful degredation where if we dont have a thumb just return a larger one and let the browser shrink it, is much preferable than a hard 429 failure. At least as long as the file isnt one of those 2gb nasa images.