Page MenuHomePhabricator

Automatically clean up unused thumbnails in Swift
Open, MediumPublic

Assigned To
None
Authored By
Gilles
Dec 11 2018, 9:26 AM
Referenced Files
F37156669: cumulative.png
Jul 31 2023, 7:52 AM
F37156666: ages.png
Jul 31 2023, 7:52 AM
F37153629: cumulative.png
Jul 28 2023, 3:57 PM
F37153627: ages.png
Jul 28 2023, 3:57 PM
Tokens
"Yellow Medal" token, awarded by fgiunchedi.

Description

In order to make room for enabling WebP thumbnail generation for all images, we need to have a continuous (or regular) cleanup mechanism for little-accessed thumbnails on Swift.

This job would delete thumbnails that meet the following criteria:

  • hasn't been accessed in 30 days
  • still has an original

The continuous/regular nature of the job is useful to allow us to introduce new thumbnail formats in the future. For example, serving thumbnails of lower quality to users with Save-Data enabled, or new formats on the horizon like AVIF (AV1's equivalent of WebP, which is said to vastly outperform WebP).

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It looks like the maximum rate at which swift-object-expirer will issue deletes is configurable via tasks_per_second, which defaults to 50. (I think the rate-limit is per thread, so the effective rate limit is concurrency * tasks_per_second).

We can set this to a very low value for production, and then ratchet it up gradually.

Unfortunately tasks_per_second was only added in 2.27, and we're running 2.10.

Unfortunately tasks_per_second was only added in 2.27, and we're running 2.10.

Not for long! There are 23 remaining backends running Stretch/2.10 which are on their way to being decomissioned and the two remaining frontends will also be updated once swift-repl is sorted out. Should be complete by end of this quarter.

What @MoritzMuehlenhoff said (though we'll be upgrading to 2.26). At any rate object-expirer will remove the actual objects from disk and from listing, at expiration time the objects will start 404'ing (independent of whether the expirer process is running) therefore I think even if it falls behind I think we'd be okay in terms of semantics. If the expirer has too much work to do that'd be fine too I think, and we can limit its concurrency until tasks_per_second is available. What do you think ?

The reason ratelimiting via tasks_per_second was introduced (per the bug) was to prevent background daemons from hammering the disk at the expense of client requests, but it looks like we can (and do) control that with ionice configs.

If disk I/O saturation is a concern, we could try setting the ionice_class for object-expirer to IOPRIO_CLASS_IDLE, which will mean object-expirer will only get I/O time when no one else needs disk. Currently it's set to IOPRIO_CLASS_BE (the default), same as object-server. However if the disk is constantly accessed then object-expirer will not make any progresss.

The reason ratelimiting via tasks_per_second was introduced (per the bug) was to prevent background daemons from hammering the disk at the expense of client requests, but it looks like we can (and do) control that with ionice configs.

If disk I/O saturation is a concern, we could try setting the ionice_class for object-expirer to IOPRIO_CLASS_IDLE, which will mean object-expirer will only get I/O time when no one else needs disk. Currently it's set to IOPRIO_CLASS_BE (the default), same as object-server. However if the disk is constantly accessed then object-expirer will not make any progresss.

Agreed, that's a big if! A bit counter-intuitive but object-expirer doesn't generate any disk I/O per-se (AFAICT) but rather issues DELETEs internally to the whole cluster. At any rate we have various ways to limit the expirer (including figuring sth out to make tasks_per_second availabile, if everything else fails)

In terms of "de-risking" the if above I think we can employ different strategies, from teaching Thumbor to expire only a sample of thumbs to a completely external process that goes in and updates thumbs with expiration dates, with tunable rates of course!

cc @MatthewVernon since this will be of interest too

I see! Do we need to employ any of these strategies, then? What (if anything) should be done before we flip this on for prod?

I see! Do we need to employ any of these strategies, then? What (if anything) should be done before we flip this on for prod?

We 100% need a point person (> 1 ideally) for this work, to make sure all the bits and pieces are in place (e.g. my comment re: statsd for object-expirer https://phabricator.wikimedia.org/T211661#6580669). I have other commitments (i.e. o11y, and no longer swift maintainership) but I'm happy to assist/provide feedback as needed!

On the technical side IMHO once we have basic visibility into the expirer' status we should turn on object-expirer + thumb expiration in thumbor for a little while, and assess impact.

@fgiunchedi and I spoke about this today. Some notes:

Work queue

When Swift receives an object with an expiration, the expiration is recorded in a .expiring_objects account, which functions as a work queue for the object-expirer. Accounts are implemented as SQLite databases. If the object-expirer is deleting items at a lower rate than expirations are set, the queue would keep growing. There's probably a large margin here (i.e. the queue can grow very large without issue) but we likely need some monitoring around this.

Monitoring

The object expirer has some built-in metrics reporting but we need to make sure those get collected by Prometheus by configuring them in the role::prometheus::statsd_exporter::mappings config.

Preventing stampedes

AIUI the current implementation on the Beta Cluster only sets object expiry when the thumbnail is initially created. This can lead to stampedes. To mitigate this risk, we should randomize the TTL (add jitter). thumb.php does this but rewrite.py does not, AFAICT.

Do we need this at all?

Lastly there's the question of whether we want to do this at all. The argument: we've managed so far; storage has gotten cheaper; and thumbnails represent a small portion of the overall media storage footprint, with originals making up the bulk. If we do decide not to go ahead with thumbnail expiration at all, though, it should be a decision, and unblock T27611 (enabling webp for all images).

Change 818145 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] Randomize thumbnail TTL to prevent stampedes

https://gerrit.wikimedia.org/r/818145

We need to decide if we want to make this change, taking into consideration the fact that the resource savings (in dollar terms) are relatively small.

Napkin-math:

  • Thumbnails take up 500 TB × 3 replicas × 2 data centers ≈ 3 PB.
  • We don't know (haven't analyzed) what portion of the thumbnail data set is cold and would expire under the new regime. Let's guess one-half, or 1.5 PB.
  • The hardware costs of storing this much data are in the order of $25k / year.

$25k is equivalent to 6 to 12 weeks of Wikimedia SRE pay (based on Glassdoor; I have no privileged information). I think there's probably a two weeks' worth of work for an SRE to enable this, and maybe a few days each year thereafter in maintenance. So from that perspective it is worth doing. However, this does not take into account opportunity cost (loss of gains from other things folks could be working on). My hunch is that there are probably lower-hanging fruit, but I don't know what they are off the top of my head.

To my mind, the decision to proceed or abandon this task should hinge on whether or not we see this as moving the thumbnail architecture forward, or holding it back by adding complexity. As several people have pointed out in conversation with me, the fact that we're storing derivative artifacts that can be re-generated relatively cheaply in 3 replicas, in 2 data centers in Swift and in Varnish in all data centers is dubious and merits a serious re-think. Adding object expiration to Swift is arguably turning it into an LRU cache, which is what Varnish / ATS should be. AIUI, the historical decision not to rely on the backend Varnish layer for thumbnail storage had to do with questions about durability or reliability and I don't know that those are still relevant.

As several people have pointed out in conversation with me, the fact that we're storing derivative artifacts that can be re-generated relatively cheaply in 3 replicas, in 2 data centers in Swift and in Varnish in all data centers is dubious and merits a serious re-think.

One important datapoint here: we have very limited capacity in rendering thumbnails, if we want to move forward with any change to swift data storage that would make our coverage of thumbnails smaller, we'd have to beef up our thumbor cluster significantly. I would also expect us to need to add better observability to thumbor itself. I think the cost in terms of additional hardware wouldn't be significantly different from adding more storage, and it will result in worse performance for the end user and an overall less stable system.

On the other hand, right now we store indefinitely every thumbnail, I would consider adding a TTL to the thumbnails at non-standard sizes (so, typically, sizes *not* used on our site, and not pregenerated by us), and for very large objects first.

However, I feel uncomfortable providing anything more than hints until we don't have some hard numbers on the distribution of requests to thumbnails.

We know that of all the requests to to varnish-upload:

  • 78-80% get cached at the edge frontend (varnish)
  • of the remaining 20%, 55-60% get cached by trafficserver
  • of the remaining 8-10%, which is between 5k and 6k rps typically, less than 50 per second hit thumbor, but these numbers include eveything.

I would love to see some numbers on how many thumbnails get a response from swift that doesn't require regenerating a thumbnail. If the number is low, the caching provided by swift isn't useful and the arguments maade here would be even more compelling.

While Disk Is Cheap (TM), container listing is not and our thumbs containers are the largest in terms of number-of-objects; I'm not entirely relaxed about the idea of never expiring any thumbnails ever.

SQLite? I'm surprised it uses that, it may not be the best performant option for such a queue...

Removing inactive task assignee (please do so as part of offboarding processes).

We had a bit of a chat about this today, and thought it worth noting some of the reasons it would be good to actually get this change deployed:

  • The benefits are about limiting the number of objects in containers rather than simply reducing disk usage
  • There are different ways in which continuing to add items to containers ad infinitum can bite us:
    • Rclone (standard tool, planned to replace swiftrepl) likes to load entire container listings into memory, causing significant RAM use on big containers
    • Containers are stored on relatively small disks, so disk exhaustion is a possible issue (which we have seen on thanos)
    • Swift fallocate when adding items to containers can cause outages e.g. T306424, T307184
    • Ceph handles large numbers of objects in buckets reasonably well (though dynamic resharding is still not available in multi-site setups), but it is still slow to list very large buckets

Change 489022 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] Set expiry headers on thumbnails

https://gerrit.wikimedia.org/r/489022

I would love to see some numbers on how many thumbnails get a response from swift that doesn't require regenerating a thumbnail. If the number is low, the caching provided by swift isn't useful and the arguments maade here would be even more compelling.

This should be doable, not super straightforward but doable. you can query webrequest_upload for a day or so for edge cache misses (more info) and compare the numbers.

Another way that might be much much easier. Delete old thumbnails in small portion of swift and check how much requests to thumbor increases.

Trying to get some numbers on thumbnails and how much of them actually reach swift vs. how much need thumbnailing got a bit derailed but I have some interesting (and some slightly off-topic notes) that would reduce our swift storage issues.

So first, if the editor doesn't explicitly set the thumbnail size in the articles, for most users, it goes to 220px. That's the default thumbsize. But mediawiki allows users to actually change that in preference. It allows these values: [ 120, 150, 180, 200, 220, 250, 300, 400 ] (except on a couple of wikis where the values are different because why not).

I think the config itself is important and should stay, specially for accessibility reasons but eight different options is way too much this allows for a lot of fragmentation and extra storage usage and they are not even popular. One of them is only enabled for 800 users while the most popular one is enabled for 9.2 million users, the second most popular is 31k. In other words, 98.9% of users are using the default and the second most popular one is 0.33%. The best part is that we really can reduce values from it because they are stored as index number for users ("2" being the default) and if we remove one, everyone's values are going to move around.

The best part: We don't even pre-generate thumbnails for these values (at least I couldn't find it by reading the source code). The ThumbnailRender job only gets triggered for another set of thumbnail sizes. The ones you see when you check an image page (UploadThumbnailRenderMap): [ 320, 640, 800, 1024, 1280, 1920 ]. See LocalFile::prerenderThumbnails().

These are actually quite useless. I'll show you the numbers.

So you can query hadoop for thumb sizes and the request scale and hits, etc.:

select cache_status, split(split(uri_path, '/')[7], 'px-')[0] as thumbsize, count(*) as hitcount from wmf.webrequest where webrequest_source = 'upload' and year = 2022 and month = 11 and day = 6 and http_status = 200 and uri_path like '/wikipedia/%/thumb/%' group by split(split(uri_path, '/')[7], 'px-')[0], cache_status order by hitcount desc limit 5000;

Here are the top hits:

cache_status	thumbsize	hitcount
hit-front	220	228563xxx
hit-front	40	118547xxx
hit-front	20	81024xxx
hit-front	23	72641xxx
hit-local	220	67274xxx
hit-front	160	60705xxx
hit-front	30	57538xxx
hit-front	45	51180xxx

220px is obviously the top. 40px and 20px are probably icons and flags, I wonder what is 23px and maybe it can be unified with 20px, etc.

I have all of the data, let me write down some stuff.

The pre-generated hit requests:

sizehitspercentage to total number of requests to thumbs
320px25M0.89%
640px24M0.85%
800px19M0.68%
1024px7M0.24%
1280px24M0.84%
1920px5.3M0.20%

OTOH:

120px96M3.4%
150px51M1.8%
180px31M1.1%
200px62M2.2%
220px303M11%
250px58M2.1%
300px51M1.8%
400px15M0.55%

FWIW, here are the numbers of cache ratio on the edges:

hit-front 2120M 75.7%
hit-local 555M 19.8%
miss 124M 4.4%
pass 13k 0.00049%

So in a day, 124M hit swift which is 14360/second. It's quite higher that what thumbor gets (which seems to be around 25/sec for both dcs and that probably includes pre-generations as well). This shows swift is actually useful in absorbing quite a lot of load, checking distribution of these requests shouldn't be hard though. We can check swift logs and compare them with the container data (on how old they have been created, what is the distribution, etc.) with sampling or something like that.

FWIW, Number of misses that use thumbsizes standards (not the pre-generated ones) is around 39M (one fourth of all misses) and the pre-generated sizes is 18M.

Last but not least, I realized new frontend features have been fragmenting thumbnails and need to be fixed ASAP.

  • MediaWiki's default in thumbnails in galleries and other small sizes is 120px
  • Page previews uses 320px (or 640px @2dppx)
  • search in portals (www.wikipedia.org) use 160px
  • search in new vector (which is due to be deployed further soon) uses 100px (or 200px @2dppx)
  • the recently deployed feature of thumbnails in Special:Search uses 150px.

These clearly need to be unified. I'll make tasks for them ASAP.

@Ladsgroup Quick reply to a few details, with Enwiki examples:

Some ideas on to move forward here:

Change 912837 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Remove 1024px and 1920px from pre-gen thumbsizes

https://gerrit.wikimedia.org/r/912837

Change 912837 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove 1024px and 1920px from pre-gen thumbsizes

https://gerrit.wikimedia.org/r/912837

Mentioned in SAL (#wikimedia-operations) [2023-05-02T09:12:55Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-02T09:51:34Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-02T09:53:22Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-02T10:00:15Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]] (duration: 08m 40s)

Worth noting here that we are now expiring objects that have an expiry header set (work was T229584).

@MatthewVernon: my understanding is that rewrite.py is currently setting expiry headers for thumbnails on retrieval from Swift -- is that correct, and does that mean some thumbnails are already getting expired?

@MatthewVernon: my understanding is that rewrite.py is currently setting expiry headers for thumbnails on retrieval from Swift -- is that correct, and does that mean some thumbnails are already getting expired?

From what I'm seeing, it's if key == 'X-Delete-At' and self.thumbnail_update_expiry_headers: which means the file needs to have the header in the first place and then rewrite.py will bump it when accessed. So we still need to tell thumbor to set the expiry header.

@MatthewVernon: my understanding is that rewrite.py is currently setting expiry headers for thumbnails on retrieval from Swift -- is that correct, and does that mean some thumbnails are already getting expired?

From what I'm seeing, it's if key == 'X-Delete-At' and self.thumbnail_update_expiry_headers: which means the file needs to have the header in the first place and then rewrite.py will bump it when accessed. So we still need to tell thumbor to set the expiry header.

Yes, that's my understanding of the code, too.

If we want to move to expiring thumbnails, then it'd be worth thumbor setting an expiry header.
I think, though, we might want to remove the check for an existing expiry header in rewrite.py, though? That way we'll start adding expiry headers to older thumbs, too, and at some point down the line we could then consider deleting thumbs with no expiry header at all (on the basis they've not been accessed in "the recent past")?

Right. Now I remember. The initial expiration is indeed supposed to be set by Thumbor. The necessary functionality had some trouble landing in the Wikimedia Thumbor plugin repo, but it has since landed.

We can enable it by setting values for SWIFT_THUMBNAIL_EXPIRY_SECONDS and SWIFT_THUMBNAIL_EXPIRY_SAMPLING_FACTOR.

The piece that seems a little shaky to me is the updating of expiration time by rewrite.py. This is done by firing off an additional request asynchronously (code). If the request fails because the authentication token expired, a new token is generated and the request is retried. So in the worst case we generate three additional requests (request with expired token, request to obtain new token, request with the new token). There is no test coverage for this code AFAICT. There are various things that could go wrong, but my main worry would be connections piling up on the Python side until Python runs out of memory or green threads.

I also don't know how well Swift would handle 15k QPS of object metadata updates (cf T211661#8377883)

Yeah, this is my concern, too - we used to spawn extra requests to copy new thumbnails to the other DC and that caused outages (cf T313102), so we stopped doing that. As @Eevans would put it - we're turning a lightweight read-only operation into a heavyweight read-write operation - I think he's been thinking in terms of some sort of sidecar process to do an LRU-type thing.

We might look at the distribution of ages of requested thumbs and pick an initial expiry age based on that (and then not update it)? Obviously then we're regenerating popular thumbs periodically, which is more load on thumbor...

It might sound a bit stupid: Why not just gradually, slowly, roll delete all thumbnails, if it's needed, it'll be regenerated.

If that's too much pressure on thumbor, We could set the header at random (let's say 1/10th sample). That would take care of popular files and the hit of deleting files that are accessed but didn't end up in the header setting, it should be fine I think (we can extend the thumb delete age to 90 days or even 180 days to make sure it's taken care of?)

I think the difficulty relates to the sheer number of thumbnails:

x=0
for i in $(swift list --prefix wikipedia-commons-local-thumb) ; do
 (( x += $( swift stat "$i" | sed -ne '/Objects/s/^.*: //p' ) ))
done
echo "$x"

tells me we currently have 2,102,572,474 thumbnails for commons (across 256 containers). Out of interest, I went looking at how many of these we served on 24 July
In eqiad across all the frontends, we served 149,521,745 requests (of any type) for wikipedia-commons-local-thumb ("thumbs" hereafter, though obviously it's not all of them), out of a total of 188,576,951 requests (so 79% of requests were for thumbs).
Those 149,521,745 requests relate to approximately (I've not e.g. filtered out listing requests and so on) 95,289,363 distinct thumbnails; these are not evenly distributed, while each thumbnail is requested on average 1.6 times, the most common entry is requested 8,924 times.

Which means that if 24 July is a representative day, we actually request about 4.5% of our stored thumbnails.

I guess the obvious follow-on question is "how old are the thumbs we're requesting?", or put differently, "how many more thumbs would we have to generate if we'd delete thumbnails older than X?" I think if I take my aggregated thumb lists and combine with the output of swift list -l I can answer that, too, but it'll take a bit more time.

[back-of-the-envelope suggests a few days, so don't expect immediate answers!]

[…] Out of interest, I went looking at how many of these we served on 24 July. In eqiad across all the frontends, we served 149,521,745 requests for [thumbs] […]

To clarify, this is Swift frontends. The term "frontend" had me thinking this was related to CDN frontends and thus that this represented a portion of edge traffic, which seemed rather low to me.

Instead, these are requests that after cache-miss from Varnish frontends, and after cache-miss from ATS backends, from POPs that map read requests to Eqiad (Swift is multi-dc for reads, some POPs map to Codfw), which then route to Swift after cache-miss. In other words, requests at the last layer before hitting Thumbor.

Random idea: The expiry header can be set either with sampling (let's say 1/10th) OR if the expiry is near in the future (let's say a week). Just a random thought.

Here are a couple of rough graphs - frequency distribution of thumbnails (served by swift on 24 July, and all thumbs), y-axis restricted (since ~10% of all thumbs are in the oldest couple of days),

ages.png (960×1 px, 14 KB)
and the cumulative distribution.
cumulative.png (960×1 px, 11 KB)

I should tweak the output to include a couple more sf; and we should look at why we're serving some thumbnails over 8000 times in a day from swift (rather than them being cached...).

Here are slightly nicer figures (more sf, which means the lines are rather more accurate) - the frequency distribution (again y-axis limited at 0.004):

ages.png (960×1 px, 19 KB)

And the cumulative frequency curve:
cumulative.png (960×1 px, 11 KB)

Note that requested thumbnails are a little younger than all thumbnails (the cumulative frequency curves cross at about day 1314); that there's a large chunk of thumbnails at the oldest possible dates (6.5% of served thumbnails and 14% of stored thumbnails are from days 3545-3656, which is 2013-10-19 - 2023-11-08, I'm guessing from when we first started storing thumbnails).

The other thing I can't quite leave alone is - why are we being asked for some thumbnails so often? Shouldn't the CDN be caching thumbs? If we served each thumb only once in that 24 hour period, that would have saved about 54 million requests to swift (which is 29% of the requests swift served), which is non-trivial...

Commonest-served thumbs on that day (with request counts):

8924 wikipedia-commons-local-thumb.8e/8/8e/Edit_remove.svg/15px-Edit_remove.svg.png
8053 wikipedia-commons-local-thumb.2c/2/2c/Broom_icon.svg/22px-Broom_icon.svg.png
6268 wikipedia-commons-local-thumb.de/d/de/Wynn.svg/25px-Wynn.svg.png
6264 wikipedia-commons-local-thumb.33/3/33/Crystal_Clear_action_viewmag.png/22px-Crystal_Clear_action_viewmag.png
6258 wikipedia-commons-local-thumb.1e/1/1e/Font_Awesome_5_solid_arrow-down.svg/19px-Font_Awesome_5_solid_arrow-down.svg.png
6256 wikipedia-commons-local-thumb.b2/b/b2/Font_Awesome_5_solid_arrow-up.svg/19px-Font_Awesome_5_solid_arrow-up.svg.png
5706 wikipedia-commons-local-thumb.b3/b/b3/Broom_icon_ref.svg/22px-Broom_icon_ref.svg.png
4990 wikipedia-commons-local-thumb.33/3/33/Crystal_Clear_action_viewmag.png/21px-Crystal_Clear_action_viewmag.png

The other thing I can't quite leave alone is - why are we being asked for some thumbnails so often? Shouldn't the CDN be caching thumbs? If we served each thumb only once in that 24 hour period, that would have saved about 54 million requests to swift (which is 29% of the requests swift served), which is non-trivial...

Commonest-served thumbs on that day (with request counts):

8924 wikipedia-commons-local-thumb.8e/8/8e/Edit_remove.svg/15px-Edit_remove.svg.png
8053 wikipedia-commons-local-thumb.2c/2/2c/Broom_icon.svg/22px-Broom_icon.svg.png
6268 wikipedia-commons-local-thumb.de/d/de/Wynn.svg/25px-Wynn.svg.png
6264 wikipedia-commons-local-thumb.33/3/33/Crystal_Clear_action_viewmag.png/22px-Crystal_Clear_action_viewmag.png
6258 wikipedia-commons-local-thumb.1e/1/1e/Font_Awesome_5_solid_arrow-down.svg/19px-Font_Awesome_5_solid_arrow-down.svg.png
6256 wikipedia-commons-local-thumb.b2/b/b2/Font_Awesome_5_solid_arrow-up.svg/19px-Font_Awesome_5_solid_arrow-up.svg.png
5706 wikipedia-commons-local-thumb.b3/b/b3/Broom_icon_ref.svg/22px-Broom_icon_ref.svg.png
4990 wikipedia-commons-local-thumb.33/3/33/Crystal_Clear_action_viewmag.png/21px-Crystal_Clear_action_viewmag.png

I think it's because these images are loaded using uncacheable thumb.php URLs from popular user scripts and gadgets.
For example, MediaWiki:Gadget-hideSidebar.js on ukwiki loads https://commons.wikimedia.org/w/thumb.php?f=Edit%20remove.svg&w=15, which resolves to the top item on your list. thumb.php requests appear to be ineligible for caching (headers indicate cache status of 'pass').

Cumulatively, the eight commonest-served thumbs in your list make up just ~52k requests to Swift, so they probably don't matter as much as the long tail.

Changed hideSidebar upstream. You can update uk.wiki.
https://pl.wikipedia.org/wiki/Wikipedysta:Nux/hideSidebar.js

That brooms are also mine. Will change that in pl:WP:SK, though I prefer the thumb.php url as I don't have to know the "/8/8e/" thingy.

Change 818145 abandoned by Ori:

[operations/puppet@production] Randomize thumbnail TTL to prevent stampedes

Reason:

Setting TTLs in rewrite.py is the wrong approach for several reasons, discussed in the associated bug thread.

https://gerrit.wikimedia.org/r/818145

Change 947390 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] Revert "Have the Swift rewrite proxy renew expiry headers"

https://gerrit.wikimedia.org/r/947390

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:47:26Z] <Emperor> depool and stop puppet on ms-fe2009 to test updated rewrite.py T211661

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:52:52Z] <Emperor> restart puppet and repool ms-fe2009 after testing T211661

The best part: We don't even pre-generate thumbnails for these values [referring to thumbnail sizes] (at least I couldn't find it by reading the source code). The ThumbnailRender job only gets triggered for another set of thumbnail sizes. The ones you see when you check an image page (UploadThumbnailRenderMap): [ 320, 640, 800, 1024, 1280, 1920 ]. See LocalFile::prerenderThumbnails().

BTW. the point of pre-rendered thumbnails is not how often they are accessed, but how quickly they are available when requested. You don't need it for thumbnails embedded in pages, because most thumbnails will render relatively quickly and async with the rest of the page, so there is not really a need to pre-render those for the first person to write and preview the article.

The pre-render sizes were introduced for two major use cases.
1: The links underneath the preview in the File page. People tended to expect these to respond quickly, but they could often be rather slow because the thumb is significantly larger
2: bucketing for MultiMediaViewer where similarly responsiveness of showing a large thumbnail was important. https://github.com/wikimedia/mediawiki-extensions-MultimediaViewer/blob/9b1d3ca24d0ff0961b08a797d8028181f8c4d512/resources/mmv/mmv.ThumbnailWidthCalculator.js#L43

These wgUploadThumbnailRenderMap, wgImageLimits and the hardcoded MMV buckets were never really consolidated into 1 feature (and possibly that isn't even needed any longer). But wanted to share that.

If we were to pre-render small thumbnails, it would probably be best to prerender the thumbnails for galleries and categories, because those run into T266155.

Is it possible to keep prerendered thumbnails indefinitely? I don't want to see search results full of broken thumbnails because these thumbnails are automatically cleared.

Is it possible to keep prerendered thumbnails indefinitely? I don't want to see search results full of broken thumbnails because these thumbnails are automatically cleared.

If a thumbnail is cleared, upon access it will be simply regenerated, it won't be lost.

Yes, the thumbnails can be regenerated. But with a page load time of 10+ seconds... This is unacceptable for hot traffic to search results, categories and galleries. What's worse is when the user goes to next page, few thumbnails can render because T266155, as said above. I encounter this situation regularly. It's a very bad experience. But at least if I loads a category successfully it will continue to load for anyone, for now. The prerendered thumbnails are exactly meant to prevent this from happening.

I understand, we won't delete everything all at once. It'll be at least a slow rolling deletion. I wanted to suggest to keep some pregen sizes but due to other reasons I don't think that's a good idea (it's a bit hard to explain it here). But be assured the user impact will be minimum.

Keep your words, don't delete everything at once. I can easily imagine a situation when I would re-visit a category 31 days later, but all the thumbnails are gone because they all expired one day ago at the same time, as these thumbnails are all generated 31 days ago so had the same expire time.

Thumbs that are being used get cached in the CDN in any case.