When pages with many thumbnails like https://commons.wikimedia.org/w/index.php?title=Category:Media_needing_categories_as_of_26_August_2018&filefrom=Starr-100901-8896-Dubautia+linearis-habitat-Kanaio+Natural+Area+Reserve-Maui+%2824419916404%29.jpg%0AStarr-100901-8896-Dubautia+linearis-habitat-Kanaio+Natural+Area+Reserve-Maui+%2824419916404%29.jpg#mw-category-media and https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Collection/Bavarian_State_Painting_Collections/18th_Century haven't been visited for a while you don't get all the thumbnails. A lot of the thumbnails will fail with an error like:
Request from **** via cp3061 cp3061, Varnish XID 971825121
Upstream caches: cp3061 int
Error: 429, Too Many Requests at Wed, 21 Oct 2020 16:03:11 GMT
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T266155 Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails | |||
| Open | None | T382108 Pre-generate thumbnails at gallery-default sizes |
Event Timeline
@ema can you confirm that int-front cache status in the response means that the 429 was emitted by Varnish? From one of those: https://github.com/wikimedia/puppet/blob/338c1bd746aedf5c7ea7303cf31c64f30b9fee93/modules/varnish/templates/upload-frontend.inc.vcl.erb#L205
Cannot confirm. int-front means that the response was generated by Varnish, but it does not mean that the 429 response comes from Varnish itself. For example, we generate custom error pages if the response from the origin has no body: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#1051
I've confirmed that it is the PoolCounter throttle from Thumbor, by hitting it myself (that's my own ipv6 address):
On a Special:NewPages load I get about 11 thumbnails rendered, and 17 429s. I don't think that the issue is the queue size of 50, but it's the timeout of 1 second (it used to be 8 seconds). I'll increase it back to 4 to see if it's enough for Special:NewFiles to get a much higher success rate.
Change 636012 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Increase timeout of Thumbor per-ip throttling
Change 636012 merged by Ema:
[operations/puppet@production] Increase timeout of Thumbor per-ip throttling
Mentioned in SAL (#wikimedia-operations) [2020-10-23T13:04:01Z] <ema> rolling thumbor-instances restart to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/636012/ T266155
The new timeout is in place. It seems to help Special:NewFiles on Commons to a degree but still doesn't avoid 429s entirely. Unfortunately we can't go back to the pre-outage values that worked better, as the increased concurrency for that scenario contributed to the outage.
One workaround I can think of in the current setup would be to change the haproxy load-balancing algorithm from "first" to hashing by X-Forwarded-For value (which contains the client IP).
This means that each IP address will only be given one Thumbor process. The main drawback of this approach is that it's leaving available processing power unused for a given user if there is any. I.e. instead of being able to effectively use 4 concurrent processes at the moment (and locking up to 50 more...), a single user will only be able to leverage one at a time, but won't lock any. In fact the PoolCounter per-IP lock should never trigger anymore.
In order to ensure at least equivalent reliability for a given user as we have now, we'll need to at least quadruple the haxproxy queue timeout (currently set at 10 seconds). This will essentially determine how old the oldest request can be, i.e. the maximum waiting period to get its ticket in line for the one Thumbor process it's going to be routed to.
I think it's worth trying as an experiment. It might introduce other problems that I can't foresee right now, but it's worth a shot. I think it's appealing that it moves the per-IP queueing responsibility to haproxy instead of having it handled by the ill-fitted Poolcounter throttle that keeps many processes locked and unusable.
Change 636024 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Switch Thumbor haproxy load balancing to IP hash
The other drawback of that workaround I can think of is that it will re-introduce some drive-by casualties for "greedy" users. For example, if your IP address hashes to the same process as a user who's just requested 50 thumbnails on it, then tough luck, you'll have to wait for that user's thumbnails to be rendered before your first one is, even if there are other Thumbor processes sitting idle in the meantime.
So it won't be all positive for everyone, it's a "pick your poison" tradeoff, but maybe waiting is preferable to 429s. The question is whether users will hit 5xx errors due to the 40s per-process haproxy queue timeout less often than they're hitting 429s at the moment. Which we should be able to see on the existing dashboards.
I just noticed this also breaks https://commons.wikimedia.org/wiki/Special:MediaSearch if you sort it by "recency".
Haven't looked into the proposed work arounds/solutions. From a user point of view I wouldn't mind slow loading thumbnails, I just hate getting the broken thumbnails.
Change 638109 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] swift: pass the 'X-Client-IP' header to thumbor
Change 638109 merged by Effie Mouzeli:
[operations/puppet@production] swift: pass the 'X-Client-IP' header to thumbor
As a work around on pages like https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Old_European_art_missing_genre/Sweden where most thumbnails don't work:
wget -E -H -k -K -p https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Old_European_art_missing_genre/Sweden
( can probably be improved by redirecting to /dev/null/ etc.)
Change 636024 abandoned by Effie Mouzeli:
[operations/puppet@production] Switch Thumbor haproxy load balancing to IP hash
Reason:
not relevant any more
Change 883570 had a related patch set uploaded (by TheDJ; author: TheDJ):
[mediawiki/core@master] Lists of images should use lazy loading
The above patch should cause native lazy loading of images by the browser. This will cause fewer initial requests to the thumbor server on those specific page types with a lot of images. Fewer requests means a lower request rate average, which should decrease the chance that people encounter the rate limit.
This is not (yet) done for normal wikipages, because we have fewer reported problems there and because mobilefrontend uses JS for lazy loading of which we don't entirely know how those two systems will interact (errors are unlikely, but perceived behavior might be strange). The pages converted here are much less critical in mobile.
Test wiki created on Patch demo by TheDJ using patch(es) linked to this task:
https://patchdemo.wmflabs.org/wikis/45fe6c92a9/w
Related Community Wishlist Survey proposal: Make Special:Search on Commons show all requested thumbnails
Change 883570 merged by jenkins-bot:
[mediawiki/core@master] Lists of images should use lazy loading
I get this 429 error with <50 thumbnails, nearly every day. Is this a part of solving this issue, too?
Example:
- https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png/153px-%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png on https://commons.wikimedia.org/wiki/Category:National_flag_of_Montenegro
- https://upload.wikimedia.org/wikipedia/commons/5/5a/%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png on https://commons.wikimedia.org/wiki/File:%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png
Because of finding such errors nearly every day I think it's a frequent problem too, and should be solved with this patch too.
Thank you very much
@doctaxon There are basically 2 main causes for 429 errors as a “normal” user of the website, but both have the same meaning: The servers are saying: "you are asking me to do too much work within too short a timeframe, please go away and try again at a later time."
In part you can blame this on bad actors who continuously try to hack and harass wikipedia and other wikimedia services, which has required the system administrators to put stricter limits in place to make sure the websites over all are able to stay up for everyone (damage limitation).
- One cause is this ticket. You are requesting something like a gallery (a page with a lot of thumbnails) for which many of the thumbnails will not yet have been generated ever before. So this causes the system to have to produce like 40+ fresh thumbnails at once. Each thumbnail takes at least a second but sometimes up to 10 seconds to produce. Together, it is too much, but the website doesn't fully handle this situation in the nicest of ways. However, generally if you later return to this same view of the page you will see more and more of the thumbnails available to you.
- Repeated failures. When thumbnails that are requested continuously and/or repeatedly fail to be generated and/or returned by the internal systems, the outer layer of the wikimedia website will also send 429 errors. Basically "we tried this a lot, its not working, we can't help you any further with this right now". So an error that is actually on top of another error (root cause). This can be for a variety of reasons, but you can identify this is the case, because generally the exact same issue for the same thumbnail url occurs even the next day. This can have dozen and dozens of reasons and really needs to be judged on a case by case basis
@TheDJ thanks for your comment. These 429 errors "needs to be judged on case by case" means it's necessary to open a bug report for everyone of these errors? Thirty hours later now the error still exists. And I think the thumbnail on https://commons.wikimedia.org/wiki/File:%C4%8Cern%C3%A1_Hora-pah%C3%BDl.png has to be generated already at least one time when the file had been uploaded.
Yes, separate issues should be filed for those, they have nothing to do with this ticket.
In this particular case, the error is ImageMagickException: Failed to convert image convert: IDAT: invalid distance too far back. Which means that the image is pointing somewhere outside the range of the file, meaning that the file is broken. Apparently this file used to work up to 2016'ish, but after that time the PNG library became stricter (to deal with potential security errors).
Generally downloading it, processing it with a tool like pngcrush and re-uploading should solve that. Which I have done.
Change 975088 had a related patch set uploaded (by TheDJ; author: TheDJ):
[mediawiki/core@master] Fix lazy loading for ImageListPager
Change 975088 merged by jenkins-bot:
[mediawiki/core@master] Fix lazy loading for ImageListPager and File history
I understand, that there are many possibilitys why a thumb is not displayed. But as a user you get either 429 or 500 es response, or a broken image icon (if the 429 or 500 happens for an image in a page (gallery, category, MediaViewer). But on the server the reason is known: PNg-vulnerability or else.
It would make a much better UX, if instead of the generic 429 or 500 response, the actual reason would be visible to the user, it would allow easier error reports. or error reports would not be needed at all, if the user can do something themselves (correcting the vulnerable PNG or else). In the case of pages (gallery, category) the server could send a replacement image that contains the text of the actual error.
I agree. Anyway, these are different causes. I've created T353950: [Tracking] Thumbnail 429 rate limiting on failed requests of complex or broken media files to keep track of these various issues that have to do with the problem of large and/or complex or repeating failures of individual files.
Just trying to think up solutions - if thumbor gives a 429, could varnish instead send an (uncached) redirect to the original file [if not huge] or some standard size (like the size that would be on the image description page so is likely to already be rendered)
It seems like a graceful degredation where if we dont have a thumb just return a larger one and let the browser shrink it, is much preferable than a hard 429 failure. At least as long as the file isnt one of those 2gb nasa images.
Loading the original file or the 800px thumb would probably be non-ideal, particularly for users with slow internet. But perhaps having the 120px and 240px thumbnails be permanently cached for all images upon their upload could be beneficial. Images sized at 120px (240px for 2x pixel ratio screens) are used in many places throughout the site (categories, default galleries, Special:ListFiles, file history, etc). And I imagine that the community would be willing to use 120px images in more places if it meant more reliable loading (e.g. The template used to render the Bavarian paintings collection example from the ticket description is Template:Wikidata list. It currently has a default image size of 128px; I don't imagine a proposal to switch that to 120px would be too controversial).
I'm not exactly sure how this might be implemented.... An easy method might be to have Varnish cache those thumbnails indefinitely, and then explicitly purge them whenever a file is reuploaded. But that comes with the fear of leaving very stale objects. It probably makes more sense to implement it in a similar way to TimedMediaHandler and handle it all inside of MediaWiki, with a maintenance job to generate these two thumbnails right after every new upload.
Loading the original file or the 800px thumb would probably be non-ideal, particularly for users with slow internet. But perhaps having the 120px and 240px thumbnails be permanently cached for all images upon their upload could be beneficial.
Delivering another size thumb, than the one requested, should in either case be limited to a cut off size. Sending 120/128/240 instead of a thumb of up to 480px seems reasonable to me. But if a thumb of 1080/2160/4320/7680px size is requested, and 240 or less is delivered, that would be a bad and unexpected server behavior.
I think if we did deliver the wrong thumbsize, it only makes sense to deliver one larger then the requested size. Browsers will scale it client side, and its always better to downscale then upscale. The performance degredation is probably be not that bad - the larger thumbs would only be shown to first person to view the page. A small slowdown to image viewing for a single user seems not that bad, and better then never loading the image at all.
I think the biggest problem is that it is very unclear how feasible this would be to implement on varnish side.
Maybe an alternative solution would be to have client side detect and try and re-request, although in the case of overload that might just make things worse
I have basically implemented what you suggested here. With slight modifications. It's 50% of images now already!
I am not sure if this is related to the changes made here.
The links to thumbnails on files pages do not link to the resolution the text of the link suggests.
The link for 320px width links to the thumbnail with 330px width.
The links for 640px and 800px width link to the thumbnail with 960px width.
