Page MenuHomePhabricator

Loading Tiff picture thumbnails results in 429 too many request while for every other file formats thumbnails are working
Closed, InvalidPublic

Description

Hello together,

I'm currently working in a student project in my university where we program the application www.openartbrowser.org where you can see open data about art from wikidata and wikipedia. The artworks we use are often pictures in the jpg format. We can use their thumbnails easily for example the Mona Lisa Thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/512px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg our original URL looks like this https://upload.wikimedia.org/wikipedia/commons/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg so we just add two strings to the path (/thumb/ and 512px-{image-title}) and it works.

However we have a lot of tiff images (around 13.000) as well which are not supported by the most browsers + they're also quite big. To get their thumbnail links we tried to use a similar method: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Scenografiskiss_av_Christian_Jansson_och_Fredrik_Wilhelm_Scholander_-_SMV_-_DTM_1939-0541.tif/lossy-page1-399px-Scenografiskiss_av_Christian_Jansson_och_Fredrik_Wilhelm_Scholander_-_SMV_-_DTM_1939-0541.tif.jpg is the thumbnail for https://upload.wikimedia.org/wikipedia/commons/f/f9/Scenografiskiss_av_Christian_Jansson_och_Fredrik_Wilhelm_Scholander_-_SMV_-_DTM_1939-0541.tif
The lossy-page1- part has to be added and the .jpg to the end the rest is similar to the jpg thumbnails.

The links work as well, but the problem is now that if we want to load multiple of these thumbnails we get a 429 Too many requests error which points us to the problem that tiff files may not be cached like jpg pictures or resolving their thumbnails takes somehow very long time.

So my question is whether there is a best practice method to get the thumbnails from tiff images without getting the 429 HTTP error?

Event Timeline

Tilomi renamed this task from Loading Tiff picture thumbnails results in 429 too many request while for every other file formats thumbnails with it works to Loading Tiff picture thumbnails results in 429 too many request while for every other file formats thumbnails are working.Jun 19 2020, 10:22 AM

Removing unrelated project tags as this does not seem to be about storing datasets on Commons or the ApiFeatureUsage extension. This seems to be related to creating thumbnails, hence adding Thumbor here (which is a bit closer, but might also turn out wrong :P).

@Tilomi: Which exact thumbnail sizes are requested by your program, based on which criteria? (Your two links once mention 399px and once 512px.) Does that mean that the thumbnail size can vary with every requested thumbnail?

How many requests are sent in which timeframe when you load thumbnails, and how are these requests sent exactly? Any API example call to share?

@Aklapper Thanks for the fast answer.

The thumbnails we use are either small (256px) or medium size (512px) i copied the example from the wikimedia site for the tiff file. From the small thumbnails we load 20 in parallel which works for the jpgs we show. I think the problem may be the size, since for the tiffs we only used 512px at the moment. I'll check if this is our problem.

The API call is indeed only the thumbnail URL i stated above e. g. for the Mona Lisa https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/512px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg.
But there are 20 links like this maybe if the pixel size is too big this problem occurs.

Are there any standard pixel sizes which are always cached by mediawiki commons? We would be especially be interested in the thumbnail sizes of the preview image https://commons.wikimedia.org/wiki/File:Scenografi_av_Christian_Jansson_-_SMV_-_DTM_1939-0648.tif . We weren't able to find an API which provides the sizes shown there. Maybe these sizes are the ones we are searching for because the pictures are cached for them?

Based on your description, it sounds like you are probably hitting the concurrent thumbnail generation limits. When a thumbnail at a particular size for a particular file is requested for the first time, it is not cached and has to be generated. This can take anywhere from a few hundred milliseconds to a minute, and it's expected that TIF files can take longer to thumbnail. You're most likely hitting the per-IP ratelimit, which in the current configuration will allow 4 new thumbnails to be generated at the request of a specific IP address at a time. Up to 50 additional requests from that IP address will wait in a queue for up to 1 second before they are processed or dropped (returning a 429).

The JPG thumbnails likely worked because the thumbnails either were already cached (and thus not subject to the same ratelimits) or could be thumbnailed quickly enough to not hit the ratelimit. The Last-Modified header on the Mona Lisa thumbnail shows Tue, 08 Oct 2019 15:26:41 GMT, so it was previously cached. The Last-Modified header for the drawing thumbnail is Fri, 19 Jun 2020 09:38:12 GMT, shortly before you filed this task. That indicates it was newly generated.

There are two main ways to solve this: trying to only request cached thumbnails and slowing down your request rate.
There's no way to guarantee that a particular thumbnail will have been previously requested and cached, but the size that is most likely to already be generated is the one displayed by default on the file description page. That image will be the original image or the largest image that can fit in an 800x600 box, whichever is smaller. Requesting an image at that size will have a higher cache-hit rate, but requires you to retrieve the original image dimensions and calculate the target size. 220px is the default thumbnail image width, but it's a bit less likely to already be cached.
The other option is to slow your request rate down so that you don't have more than 4-5 concurrent requests.
Thumbnailing time can be affected by output size, but it is affected by several factors and generally does not scale smoothly with output size.

If these suggestions don't help, please provide a more thorough description of how your application makes thumbnail requests. The HTTP response headers that accompany a 429 and any information provided in the error message would be helpful as well.


The per-IP throttle settings were previously higher, but got lowered in T252426. There may be a middle ground between the two values worth exploring, but that's outside of my area.

@AntiCompositeNumber Thanks for the answer, this is really useful information for us. From your description i would say the concurrent request limit gets us the 429 error since we load more than 4 pictures at once.

We also thought about downloading the thumbnails we need and host them from our server since we can build the links, but i'd prefer one solution you gave us.

@Tilomi: Anything left to do in this task, or can this ticket be closed as invalid (as there were no changes in code on the Wikimedia servers)?

You can close it, thanks again for the answer!