Page MenuHomePhabricator

Measure request frequency of thumbnail sizes
Closed, ResolvedPublic

Description

In T408715 I compiled a list of canonical thumbnail sizes in use. The next question is: how frequently are these sizes used?

Using a convenience sample of a 24-hour period (2025-10-24) and the wmf.webrequest spark table (i.e. requests reaching the CDN):

  • the 8 wgThumbnailSteps account for 86% of thumbnail requests (and are the 8 most-requested sizes)
  • the 30 canonical sizes listed in T408715 account for 92% of thumbnail requests
  • the 30 most-requested sizes account for 94% of thumbnail requests
  • the 4 pregenerated thumbnail sizes account for 2% of thumbnail requests
  • the commonest non-canonical size is 480 (ranked 9), accounting for 1.4% of requests (434/second)
  • Over 8000 different thumbnail sizes were requested, from 1 to 121200
  • 316 different thumbnail sizes were requested at least once per second (i.e. > 86400 per day)
  • Extracting thumbnail size from uri_path is not entirely straightforward

Taking our commonest 30 sizes, in rank order:

sizecountrps%cumulative %canonical
250555502137642920.220.2Y
60397775830460414.434.6Y
500374795875433813.648.2Y
40362872476412013.261.4Y
120303617631351411.072.4Y
33015826017718325.778.1Y
2015456231817895.683.8Y
960716843228302.686.4Y
480375028854341.487.8NO
1200242678262810.988.7NO
70223796212590.889.4Y
1280219689782540.890.2Y
640151919111760.690.8Y
800142932041650.591.3Y
102492688691070.391.6Y
6006392998740.291.9NO
2006321587730.292.1Y
1606259469720.292.3Y
19205913263680.292.5Y
3205879692680.292.7Y
1505567819640.292.9Y
25605556343640.293.1Y
5125450688630.293.3NO
3005441267630.293.5Y
3755184308600.293.7NO
324838561560.293.9NO
803979487460.194.0Y
243831704440.194.2NO
1003805802440.194.3NO
4003600631420.194.5Y

After these top 30, we've accounted for 2601967679 / 2754743467 requests, leaving 5.5% requests amongst the remaining sizes (or a combined 1768 requests/second).

Taking our list of 30 canonical sizes in size order:

sizecount%rank
201545623185.67
4036287247613.24
6039777583014.42
70223796210.811
8039794870.127
12030361763111.05
15055678190.221
16062594690.218
18030947780.133
20063215870.217
22029827010.134
24029631460.135
25055550213720.21
2609418600.0371
30054412670.224
32058796920.220
3301582601775.76
36019734110.0747
40036006310.130
4507506190.0379
50037479587513.63
640151919110.613
800142932040.514
960716843222.68
102492688690.315
1280219689780.812
192059132630.219
256055563430.222
28806854250.0283
38403695<0.011285

Details

Event Timeline

A couple of notes on extracting thumbnail size from uri_path - a previous approach used

SELECT split(split(uri_path, '/')[7], 'px-')[0] as thumbsize

but this has a number of shortcomings, particularly that the array index of 7 is fragile, and incorrect for e.g. /archive/ thumbs. So I refined it somewhat to take the final path element, and then split that at px- and then split the result on - and take the final element (thus coping with prefix-NNNpx like you get with translated SVG files):

1select slice(split(split(slice(split(uri_path, '/'),-1,1)[0], 'px-')[0],'-'),-1,1)[0] as thumbsize, count(*) as hits from wmf.webrequest where webrequest_source = 'upload' and year = 2025 and month = 10 and day = 24 and hour = 10 and http_status = '200' and uri_path like '/wikipedia/%/thumb/%' group by thumbsize order by hits desc limit 10;

This still left a very few stragglers (15, mostly SVG files with URL-encodings in their names), which is likely good enough, but we can do better with a simple regexp:

select regexp_extract( slice(split(uri_path, '/'),-1,1)[0], '([0-9]+)px') as thumbsize, count(*) as hits from wmf.webrequest where webrequest_source = 'upload' and year = 2025 and month = 10 and day = 24 and http_status = '200' and uri_path like '/wikipedia/%/thumb/%' group by thumbsize order by hits desc;

This produces the same answers (modulo the 15 errors), is clearer, and only takes ~10% longer to run. Finally, of course, we can just do the whole operation with a single regexp - to match for thumbsize as previous and then state that it must be followed by only not-\ characters:

select regexp_extract(uri_path, '([0-9]+)px[^/]+$') as thumbsize, count(*) as hits from wmf.webrequest where webrequest_source = 'upload' and year = 2025 and month = 10 and day = 24 and http_status = '200' and uri_path like '/wikipedia/%/thumb/%' group by thumbsize order by hits desc;

At the scale we are talking, they won't make any dent in the stats.

It's about 0.5% difference in count of 250, which isn't a vast amount, but it's not nothing. And the ranking of the top-30-by-hits changes (at least 200/600 swap places, there are other shifts too, albeit not in the top 10). So I think it was worth spending a little time working on improving the query.

So 480 is quite common, but hasn't showed up in our search. I thought it might be instructive to check referer:

select referer, count(*) as hits from wmf.webrequest where webrequest_source='upload' and year=2025 and month=10 and day=24 and hour=10 and http_status='200' and uri_path like '/wikipedia/%/thumb/%' and regexp_extract(uri_path, '([0-9]+)px[^/]+$')='480' group by referer order by hits desc LIMIT 10;

Gives us

(compared to a total of 1,698,043) Which makes me think that it's something on-wiki that is producing these 480px links... Any ideas? The top 10 paths by hit-count are all SVG files (which may be a red herring of course).

Very likely a popular gadget/css hardcoding the url. I investigate once I get my hands on a PC

Thanks! I've spent a fair chunk of time searching and have come up with nothing. My next stop is likely #no-stupid-questions...

I have solved the easy one, though: ecosia. If you image search on there (e.g. https://www.ecosia.org/images?q=cattle and find the wikipedia hit (about fourth row down), it's hard-coding the link to https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Cow_(Fleckvieh_breed)_Oeschinensee_Slaunger_2009-07-07.jpg/480px-Cow_(Fleckvieh_breed)_Oeschinensee_Slaunger_2009-07-07.jpg (though that does bring about the question of where/how it's getting that from )

I haven't found anything in gadgets, etc. https://commons.wikimedia.org/w/index.php?title=Special:Search&limit=500&offset=0&ns8=1&search=%22480px-%22 (same in enwiki, etc.) I found a lot of weird stuff in meta but still all unrelated: https://meta.wikimedia.org/w/index.php?search=%22%2F480px-%22&title=Special%3ASearch&profile=advanced&fulltext=1&ns8=1

What I found extremely interesting though is that it's not one image. The top path gets only 2K hits in an hour:

spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and year=2025 and month=10 and day=24 and hour=10 and http_status='200' and uri_path like '/wikipedia/%/thumb/%' and regexp_extract(uri_path, '([0-9]+)px[^/]+$')='480' group by uri_path order by hits desc LIMIT 10;
uri_path	hits
/wikipedia/commons/thumb/8/82/Telegram_logo.svg/480px-Telegram_logo.svg.png	2430
/wikipedia/commons/thumb/2/2f/Google_2015_logo.svg/480px-Google_2015_logo.svg.png	2279
/wikipedia/commons/thumb/e/e4/Status_iucn3.1_LC_ru.svg/480px-Status_iucn3.1_LC_ru.svg.png	2139
/wikipedia/commons/thumb/9/96/Flag_of_the_United_States_%28DDD-F-416E_specifications%29.svg/480px-Flag_of_the_United_States_%28DDD-F-416E_specifications%29.svg.png	1946
/wikipedia/commons/thumb/a/a5/Flag_of_the_United_Kingdom_%281-2%29.svg/480px-Flag_of_the_United_Kingdom_%281-2%29.svg.png	1543
/wikipedia/commons/thumb/f/f3/Flag_of_Russia.svg/480px-Flag_of_Russia.svg.png	1493
/wikipedia/commons/thumb/3/32/Googleplex_HQ_%28cropped%29.jpg/480px-Googleplex_HQ_%28cropped%29.jpg	1485
/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/480px-Flag_of_France.svg.png	1432
/wikipedia/commons/thumb/2/20/YouTube_2024.svg/480px-YouTube_2024.svg.png	1405

if it was a gadget or something, it would have been one image being hammered.

Turnilo for the Telegram Logo (first hit in what @Ladsgroup ) says: Google Proxy as the ISP, in an staggering 85% of the cases. However, it sends those requests with no referrer.

@Ladsgroup, just to be consistent with what @MatthewVernon reported above, should your query have an AND for the referer to match %.wikipedia.org?

spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and year=2025 and month=10 and day=24 and hour=10 and http_status='200' and uri_path like '/wikipedia/%/thumb/%' and regexp_extract(uri_path, '([0-9]+)px[^/]+$')='480' AND referer like '%.wikipedia.org' group by uri_path order by hits desc LIMIT 10;
uri_path	hits
/wikipedia/commons/thumb/7/7e/Map_of_Fukuoka_Prefecture_Ja.svg/480px-Map_of_Fukuoka_Prefecture_Ja.svg.png	1
/wikipedia/commons/thumb/a/a5/Infobox_info_icon2.svg/480px-Infobox_info_icon2.svg.png	1
/wikipedia/commons/thumb/d/d5/Lynx-wikipedia.png/480px-Lynx-wikipedia.png	1
/wikipedia/commons/thumb/8/81/Haiku_R1_Beta_3_desktop_screenshot.png/480px-Haiku_R1_Beta_3_desktop_screenshot.png	1
/wikipedia/commons/thumb/7/75/Small_Pencil_Icon.svg/480px-Small_Pencil_Icon.svg.png	1
/wikipedia/commons/thumb/5/5d/Dooble_Showing_Dutch_Wikipedia.png/480px-Dooble_Showing_Dutch_Wikipedia.png	1
/wikipedia/commons/thumb/e/eb/Emoji_u1f33f.svg/480px-Emoji_u1f33f.svg.png	1
/wikipedia/commons/thumb/f/fc/MEPIS_logo.svg/480px-MEPIS_logo.svg.png	1
/wikipedia/commons/thumb/6/62/W3m-wikipedia.png/480px-W3m-wikipedia.png	1
/wikipedia/commons/thumb/1/10/TDA1%2C_ptc_catalyst.svg/480px-TDA1%2C_ptc_catalyst.svg.png	1
Time taken: 56.367 seconds, Fetched 10 row(s)
spark-sql (default)>

The query was wrong, the like should have an extra % at the end. Let me try again.

spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and year=2025 and month=10 and day=24 and hour=10 and http_status='200' and uri_path like '/wikipedia/%/thumb/%' and regexp_extract(uri_path, '([0-9]+)px[^/]+$')='480' AND referer like '%.wikipedia.org%' group by uri_path order by hits desc LIMIT 10;
uri_path	hits
/wikipedia/commons/thumb/e/e4/Status_iucn3.1_LC_ru.svg/480px-Status_iucn3.1_LC_ru.svg.png	2106
/wikipedia/commons/thumb/9/96/Flag_of_the_United_States_%28DDD-F-416E_specifications%29.svg/480px-Flag_of_the_United_States_%28DDD-F-416E_specifications%29.svg.png	1939
/wikipedia/commons/thumb/a/a5/Flag_of_the_United_Kingdom_%281-2%29.svg/480px-Flag_of_the_United_Kingdom_%281-2%29.svg.png	1532
/wikipedia/commons/thumb/f/f3/Flag_of_Russia.svg/480px-Flag_of_Russia.svg.png	1482
/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/480px-Flag_of_France.svg.png	1400
/wikipedia/commons/thumb/e/e2/Flag_of_the_United_States_%28Pantone%29.svg/480px-Flag_of_the_United_States_%28Pantone%29.svg.png	1297
/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/480px-Flag_of_the_People%27s_Republic_of_China.svg.png	1285
/wikipedia/commons/thumb/b/ba/Flag_of_Germany.svg/480px-Flag_of_Germany.svg.png	1272
/wikipedia/commons/thumb/e/e2/White_House_ballroom_plan%2C_October_2025.svg/480px-White_House_ballroom_plan%2C_October_2025.svg.png	1254
/wikipedia/commons/thumb/a/a9/Flag_of_the_Soviet_Union.svg/480px-Flag_of_the_Soviet_Union.svg.png	1218
Time taken: 36.124 seconds, Fetched 10 row(s)

I picked a random path that was hit and looked the IP and basically looked at the previous and after requests at the same time but the same IP. I picked a IPv6 to reduce the chance of overlap.

The only request to the actual wikis at the same time was to one page, and then there were requests to rest api endpoints linked in that page, so for example: https://de.wikipedia.org/api/rest_v1/page/summary/Frankreich and then loading of the image with 480px px (https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/480px-Flag_of_France.svg.png), the problem is that the endpoint is not providing the incorrect size (it provides https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/langde-330px-Flag_of_France.svg.png). Maybe caches got expired? I checked it for this month and it still we get a lot of those (top hits are mostly flags now):

spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and year=2025 and month=11 and day=24 and hour=10 and http_status='200' and uri_path like '/wikipedia/%/thumb/%' and regexp_extract(uri_path, '([0-9]+)px[^/]+$')='480' AND referer like '%.wikipedia.org%' group by uri_path order by hits desc LIMIT 10;
uri_path	hits
/wikipedia/commons/thumb/9/96/Flag_of_the_United_States_%28DDD-F-416E_specifications%29.svg/480px-Flag_of_the_United_States_%28DDD-F-416E_specifications%29.svg.png	2276
/wikipedia/commons/thumb/e/e4/Status_iucn3.1_LC_ru.svg/480px-Status_iucn3.1_LC_ru.svg.png	1970
/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/480px-Flag_of_France.svg.png	1710
/wikipedia/commons/thumb/b/ba/Flag_of_Germany.svg/480px-Flag_of_Germany.svg.png	1685
/wikipedia/commons/thumb/a/a5/Flag_of_the_United_Kingdom_%281-2%29.svg/480px-Flag_of_the_United_Kingdom_%281-2%29.svg.png	1666
/wikipedia/commons/thumb/f/f3/Flag_of_Russia.svg/480px-Flag_of_Russia.svg.png	1635
/wikipedia/commons/thumb/a/a9/Flag_of_the_Soviet_Union.svg/480px-Flag_of_the_Soviet_Union.svg.png	1466
/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/480px-Flag_of_the_People%27s_Republic_of_China.svg.png	1436
/wikipedia/commons/thumb/c/cf/GraphyArchy_-_Wikipedia_00185-cropped.jpg/480px-GraphyArchy_-_Wikipedia_00185-cropped.jpg	1369
/wikipedia/commons/thumb/e/e2/Flag_of_the_United_States_%28Pantone%29.svg/480px-Flag_of_the_United_States_%28Pantone%29.svg.png	1343

Okay, I checked several more cases and they all seems to be coming from rest endpoint for page summary. For example, there is another one that hits https://en.wikipedia.org/api/rest_v1/page/summary/World_War_II and immediately https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Bundesarchiv_Bild_101I-646-5188-17%2C_Flugzeuge_Junkers_Ju_87.jpg/480px-Bundesarchiv_Bild_101I-646-5188-17%2C_Flugzeuge_Junkers_Ju_87.jpg which is the same image provided in the thumbnail attribute of the page summary endpoint but when I open the link the thumbnail is a different size o.O

Change #1211107 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Popups@master] constants: Force 480px to 500px

https://gerrit.wikimedia.org/r/1211107

Change #1211107 abandoned by Ladsgroup:

[mediawiki/extensions/Popups@master] constants: Force 480px to 500px

Reason:

Done in a different way

https://gerrit.wikimedia.org/r/1211107