Page MenuHomePhabricator

Automatically clean up unused thumbnails in Swift
Open, NormalPublic

Description

In order to make room for enabling WebP thumbnail generation for all images, we need to have a continuous (or regular) cleanup mechanism for little-accessed thumbnails on Swift.

This job would delete thumbnails that meet the following criteria:

  • hasn't been accessed in 30 days
  • still has an original

The continuous/regular nature of the job is useful to allow us to introduce new thumbnail formats in the future. For example, serving thumbnails of lower quality to users with Save-Data enabled, or new formats on the horizon like AVIF (AV1's equivalent of WebP, which is said to vastly outperform WebP).

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptDec 11 2018, 9:26 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gilles triaged this task as Normal priority.Dec 11 2018, 9:27 AM

Thanks @Gilles for kickstarting this! For context these are the notes I took when we did the first round of cleanup a couple of years back: https://wikitech.wikimedia.org/wiki/Swift/Thumbnails_Cleanup

Some ideas IIRC came up at the time for an online deletion job:

  1. Use a bloom filter (e.g. via bloomd) to test for last-accessed, the filter can be populated either from varnish access logs and/or swift access logs firehoses. False positive rate shouldn't be a problem, on a given run we'll skip for deletion a certain % of candidates. Said false positives should be picked up on a later run though when the filter is populated from different data.
  2. To test for original existence we can ask swift directly, as long as request rate is limited (e.g. < 500/s) it shouldn't have a significant impact

HTH!

ema moved this task from Triage to Watching on the Traffic board.Dec 12 2018, 2:22 PM

I've looked at bloomd's capabilities and it doesn't have a concept of things expiring, which makes sense given the nature of a bloom filter.

What if we had 30 instances of bloomd running at once, each started one day after the other? I.e. first bloomd instance was started 30 days ago, second one 29 days ago, etc. When thumbnail hits happen, we send updates to all 30 of them, but when we want to query the filter, we always look up the oldest bloomd instance. Every day we shut down and clean up the oldest instance and start a new one.

They can all persist on disk, which will make them resistant to reboots. This is very important to retain the 30 days of history.

The big unknown is how memory-intensive a given instance would be in practice, given the very long tail of content we have, which would have to be multiplied by approximately 30. I guess we can start with a single instance of bloomd on a codfw machine and seeing how large it grows over time?

To mitigate the duplicated memory use, we can also reduce the granularity, to say, 5 weekly instances. One started at most 35 days ago, the second one 28 days ago, etc. Which means that when the 35-day-old instance gets nuked, we know about the last 28 days of thumbnails. And we keep at most 35 days worth of memory (in the logical sense) about thumbnail hits.

@fgiunchedi what do you think of this approach of rolling bloomd instances?

CDanis added a subscriber: CDanis.Feb 5 2019, 4:29 PM
ori added a subscriber: ori.Feb 5 2019, 6:19 PM

What orders of magnitude are we talking about with respect to: total number of thumbnails in Swift, and number of thumbnail requests per second?

ori added a comment.Feb 5 2019, 6:38 PM

It seems that Swift has built-in support for object expiration, which can be requested by setting a header (either X-Delete-After or X-Delete-At).
It also looks like the expiry can be re-set, either by first removing it via X-Remove-Delete-At, and then setting it anew, or by updating the metadata in-place.
Is there a reason why these mechanisms are not under consideration?

A straw-man proposal:

  • On thumbnail creation, set X-Delete-After: 2592000 (one month).
  • Each time a thumbnail is retrieved, there's a 0.01% chance we also used the opportunity to reset the expiry to one month.

This hadn't been considered because I didn't know that feature existed... Checking on the production Swift servers, swift-object-expirer isn't available, which is probably why I didn't know about it. This looks perfect for our needs. I need to figure out which swift version came up with this feature and/or whether it's an optional install of some kind on Debian.

For setting it on X% of access, preferably async, we'll just need a little logic to check that the thumbnail has a corresponding original. On thumbnail creation that's not needed, I'll probably start by implementing that part.

ori added a comment.Feb 5 2019, 7:22 PM

(It might have to be X% of access on varnish, since I assume the most oft-requested thumbnails enjoy a very high varnish cache hit rate. You could do this asynchronously by having a daemon that samples thumbnail requests from the varnishkafka stream.)

We discussed this a bunch and we believe that basing this on Swift access alone with an expiry of a month is fine. The hottest things can't stay in Varnish for a month, even if you combine the longest lifetime possible at every Varnish layer. I can't remember what the combined Varnish expiry values were, but this is what @BBlack said.

ori added a comment.Feb 5 2019, 7:51 PM

The worry I had was this: a thumbnail that is requested once a minute on average probably has an approximately similar varnish cache hit rate to a thumbnail that is requested a hundred times per second. If that's true, then both thumbnails would be retrieved from Swift about as often, and they would be equally likely to have their expiry renewed before they're vacuumed up. This increases the risk of overload in case of a varnish cold restart or failure.

If you cranked the Swift retrieval sampling factor to a high enough value, this wouldn't matter, because both the once-per-minute and 100-per-second thumbnails would be virtually guaranteed to get refreshed before the 30 days are up. But then you're doing a lot more metadata updates in Swift than you really need to. You could probably get away with doing much fewer metadata updates if you sampled varnish requests instead.

There's also a fair bit of experience at the WMF in setting up and operating varnishlog-tailing daemons, of which this would be an instance. Is there a similarly robust way of tailing the swift req logs? (My memory is very rusty...)

The simplest architecture really is to touch swift objects on every retrieval, which can be done async, but the unknown is how much extra load having every read causing a write would cause. I think we can easily find out by applying this logic to only one Commons thumbnail shard, or containers of smaller wikis. Same thing for experimenting with a percentage of requests triggering touches and seeing how much extra thumbor load that might cause.

The issue of tailing logs, be it varnish or swift, is that we have a very long tail of URLs being hit, and you're pretty much back to bloom filters being the optimal solution, in which case we end up doing what I suggested earlier.

I think it's worth trying the "simple thing" first on a portion of the traffic, as we might worry too much about the effects of updating a header asynchronously on each read. Even though it is a scary-looking 100x increase in write operations.

IIRC object expiration was considered years ago (i.e. https://wikitech.wikimedia.org/wiki/Swift/ObjectExpiration) and at the time considered buggy for production use, though after 4+ years I'm sure the situation has changed so definitely worth reconsidering as an option too.

To answer a few questions:

There's also a fair bit of experience at the WMF in setting up and operating varnishlog-tailing daemons, of which this would be an instance. Is there a similarly robust way of tailing the swift req logs? (My memory is very rusty...)

Swift access logs ATM are stored only on swift proxies themselves, though should be reasonably easy to tail/process them nowadays.

What orders of magnitude are we talking about with respect to: total number of thumbnails in Swift, and number of thumbnail requests per second?

ATM there's ~1.2B objects in thumbnails containers, re: thumbnails requested per second I don't have a breakdown handy though swift sees a peak of 1.5k requests a second (2xx+3xx that is) which I'm assuming thumbnails are the majority. For writes it is about 100x less than that (i.e. ~10/s) https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-7d&to=now-1m

And indeed I share the concerns already mentioned, namely making sure we're able to have a bound on writes (deletes and/or metadata updates) as we're expiring thumbnails.

ori added a comment.Feb 6 2019, 3:35 PM

And indeed I share the concerns already mentioned, namely making sure we're able to have a bound on writes (deletes and/or metadata updates) as we're expiring thumbnails.

Hi Filippo! :) Thanks for humoring my kibbitzing.

You could update metadata on 0.01% of requests, but cap that to a fixed maximum number of metadata updates per minute. And if your expiry times are relative (using X-Delete-After), the limit on metadata updates also gives you an upper bound on the rate of deletes. (But it also means that any time you tweak the rate of metadata updates, there is a delay of $TTL seconds before you see the effect on deletions.) I figure you'd start with a very conservative cap and then bump it up over time.

jijiki added a subscriber: jijiki.Feb 6 2019, 4:13 PM

It's true that even if we only clean up a portion of thumbnails, we're already in a good place. The operational goal is to free up space at a rate fast enough to allow introducing new variants (eg. WebP). Ideally we can crank this up high enough that total storage use goes down over time, even slightly.

Gilles added a comment.EditedFeb 6 2019, 4:49 PM

What we can do is if we see that the thumbnail already has a X-Delete-After header on get, we update the header. If it doesn't have the header yet, we roll the dice. This avoids the potential issue of a touched thumbnail appearing to be unused just because the subsequent times it was accessed, it didn't get (randomly) touched.

jbond added a subscriber: jbond.Feb 6 2019, 5:28 PM

And indeed I share the concerns already mentioned, namely making sure we're able to have a bound on writes (deletes and/or metadata updates) as we're expiring thumbnails.

Hi Filippo! :) Thanks for humoring my kibbitzing.
You could update metadata on 0.01% of requests, but cap that to a fixed maximum number of metadata updates per minute. And if your expiry times are relative (using X-Delete-After), the limit on metadata updates also gives you an upper bound on the rate of deletes. (But it also means that any time you tweak the rate of metadata updates, there is a delay of $TTL seconds before you see the effect on deletions.) I figure you'd start with a very conservative cap and then bump it up over time.

Hi Ori :) No problem -- thanks for chiming in, appreciated!

Agreed starting with a very small percentage of requests (as seen by swift, aiui) and set deletion is a good knob to have.

What we can do is if we see that the thumbnail already has a X-Delete-After header on get, we update the header. If it doesn't have the header yet, we roll the dice. This avoids the potential issue of a touched thumbnail appearing to be unused just because the subsequent times it was accessed, it didn't get (randomly) touched.

Sounds like a good optimization going forward, though how we would limit the header updates when a large number of thumbnails being fetched has set x-delete-after ?

Also AIUI we'd do this work in swift proxy not varnish?

I thought we would start with a very low percentage and ramp it up gradually. And yes, I thought our beloved swift proxy is where it would live.

Is it possible to hold this a bit for until after we upgrade all Thumbor servers to stretch? Two birds with one stone :)

I'd argue that we don't want both changes to happen around the same time. And this is probably less prone to emergency bugfixes than the Stretch upgrade would be. I would rather see this deployed first, as the Stretch upgrade will probably create a trickle of bugs to fix after it's deployed (potential SVG/PDF rendering regressions, etc.).

I'd argue that we don't want both changes to happen around the same time. And this is probably less prone to emergency bugfixes than the Stretch upgrade would be. I would rather see this deployed first, as the Stretch upgrade will probably create a trickle of bugs to fix after it's deployed (potential SVG/PDF rendering regressions, etc.).

You have a point, +1.

ori added a comment.EditedFeb 7 2019, 4:07 PM

I don't understand the preference for sampling Swift requests rather than Varnish requests. You'd have greater resilience to overload (for the reasons I cited above), and you'd have looser coupling and better fault tolerance by building on top of kafka.

I'm also not sure I understand the point of checking whether the image already has an expiry header or not. Is the idea that if an image doesn't have the header, then we don't have to worry about it expiring? That's true, but:

  • Initially, no thumbnail will have the expiry header, so without some additional criterion for when you should set the header, you'll never expire anything.
  • Eventually, every thumbnail will have the expiry header, so once the initial ramp-up is done, this is no longer a relevant signal at all.

So you're getting very little value out of this approach IMO, and it forces you to read response headers from Swift. By looking at responses coming from Swift rather than requests coming in to Varnish, you don't have a good signal for what the hottest thumbnails are, since somewhat-popular thumbnails will be retrieved from Swift about as often as extra-super-popular thumbnails.

Gilles added a comment.EditedFeb 7 2019, 8:37 PM

The "initial ramp up" might not ever be done, if we reach a point where the writes and deletes introduced are creating too much overhead, we'll settle for only cleaning up some of the thumbnails, which is absolutely fine. If we run into that situation, we may stay in it for a while, but it's already an improvement compared to the status quo. Maybe we will increase the Swift capacity to allow us to clean up everything, maybe we won't. The goal is freeing up space/reducing the storage growth to introduce new features.

Without knowing the effects any cleanup strategy will have ahead of time, only doing a partial ongoing cleanup is the most likely scenario, until it's demonstrated that it's fine to do fully for 100% of thumbnails.

If we try this out on a portion of the traffic with headers, once we put a thumbnail in the experiment by setting a header, it needs to stay in it. Otherwise it might look expired when it isn't, if you only touch it randomly.

Being blind to Varnish is absolutely inconsequential when the hottest thumbnails will get renewed several times in the 30-day window, because they're guaranteed to fall out of Varnish multiple times, regardless of how hot they are. Any individual image can't stick around in the Varnish layers for more than a few days, even if hit constantly. By the time the object expirer gets to an object that hasn't been touched in 30 days, it's guaranteed not to be a hot thumbnail from Varnish's perspective. So you're getting that guarantee indirectly, without any need to record or parse any traffic stats, just by virtue of the relationship between the different expiry durations for the Varnish and Swift layers.

What matters is not cleaning up a thumbnail that's hot at the edge, and you get that by waiting long enough that it would have been fetched from Swift multiple times if that was the case. Once you reach 30 days of being untouched, you know for a fact that nobody has accessed it during that period minus the few days something can stay in Varnish.

I already have the whole Swift-based setup working in Vagrant as of earlier today, it'll be up for review soon, once I've cleaned up the hardcoded stuff. It's very simple, which is why I think it's the way to go for an initial attempt. Since we have no idea what the impact will be, it's also possible that Swift won't break a sweat with the extra header writes and deletes. I don't see the point of building a more complicated solution based on "what ifs". If this approach doesn't work out, then it makes sense to consider a more complex setup with more services involved.

Change 489021 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Make Swift renew X-Delete headers when encountered

https://gerrit.wikimedia.org/r/489021

Change 489022 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/software/thumbor-plugins@master] Set expiry headers on thumbnails

https://gerrit.wikimedia.org/r/489022

Peter added a subscriber: Peter.Feb 11 2019, 9:15 PM
aaron added a subscriber: aaron.Feb 16 2019, 8:49 AM

It seems that Swift has built-in support for object expiration, which can be requested by setting a header (either X-Delete-After or X-Delete-At).
It also looks like the expiry can be re-set, either by first removing it via X-Remove-Delete-At, and then setting it anew, or by updating the metadata in-place.
Is there a reason why these mechanisms are not under consideration?
A straw-man proposal:

  • On thumbnail creation, set X-Delete-After: 2592000 (one month).
  • Each time a thumbnail is retrieved, there's a 0.01% chance we also used the opportunity to reset the expiry to one month.

The simplest architecture really is to touch swift objects on every retrieval, which can be done async, but the unknown is how much extra load having every read causing a write would cause.

Heh, this sounded like déja vu. Then I realized I had an old etherpad (https://etherpad.wikimedia.org/p/swift-thumbnails/timeslider#309) linked from T77697. Looking at that now, I notice that the "K hours" bits in that etherpad should really read "K days" since some non-trivial CDN/Swift TTL delta is just the bare minimum to be able to guess "hotness" in common cases...there is of course no reason for the delta to merely be hours given the backup/cold-cache use case. Also, missing from that etherpad is handling for requests coming in *sooner* than the CDN expiry (e.g. what if CDN was randomly cleared or several CDN nodes failed-over?). I guess in that case, the POST could be skipped and the same X-Object-Meta-CDN-Expires-At use for the Expires header. I suppose Varnish "collapsed forwarding"/"busy hashes" would avoid Swift POST stampedes for single URLs after CDN expiry. The added POST writes due to the size of the long-tail of cache misses for thumbnails that *are* in Swift might be tolerable.

Anyway, for any non-percentage based scheme to POST new Swift TTLs (X-Delete-At), I'm wondering about malicious cases of someone spamming requests with cache busting URL parameters to millions of different thumbnails that already exists in Swift (thus they would not trigger thumbor/thumb.php throttling). Crawling some "most used files" list and hitting the standard sizes for thumbnails might be used for such malice. That could still cause unthrottled POST spam to Swift unless some rate-limiting is added.

Gilles added a comment.EditedFeb 16 2019, 4:11 PM

Good point. I think we just ought to clean up that loophole in Varnish, which is already removing the most obvious cache-busting attempts (GET parameters). If Varnish normalized the last part of the thumbnail URL, there wouldn't be a way to spam Swift for existing objects.

Change 489021 merged by jenkins-bot:
[mediawiki/vagrant@master] Make Swift renew X-Delete headers when encountered

https://gerrit.wikimedia.org/r/489021

Gilles changed the task status from Open to Stalled.Mar 21 2019, 9:49 AM
Gilles changed the task status from Stalled to Open.Jun 21 2019, 8:22 AM

@aaron your concern has been addressed now, the Varnish-level thumbnail URL normalization is live. We can now proceed with the header-based expiry plan.

@Gilles can it wait till next week when we will all be back, unless it is urgent, in which case we will figure it out

It can wait. Basically I want to figure out where we're at in regards to that patch, what's actually deployed and running.

Nevermind, this was the Vagrant patch... I'm going to make the production one now

Change 518226 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Have the Swift rewrite proxy renew expiry headers

https://gerrit.wikimedia.org/r/518226

Change 518226 merged by Effie Mouzeli:
[operations/puppet@production] Have the Swift rewrite proxy renew expiry headers

https://gerrit.wikimedia.org/r/518226

Change 519374 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Only apply expiry logic to "thumb" zone

https://gerrit.wikimedia.org/r/519374