Page MenuHomePhabricator

Investigate Swift replacement for thumbnails
Closed, ResolvedPublic

Description

Swift appears to be the main cause for sub-par thumbnail performance, for images that fall out of varnish cache. In addition to this, we currently store 3 replicas of each Swift file, which is a considerable storage waste.

Ideally a replacement for thumbnail storage should have these properties:

  • Its number one quality should be speed for pulling a file. It's all that matters, serving a file to varnish as fast a possible.
  • It doesn't matter if it loses content, since we can always re-generate thumbnails. No replication required

Unlike what we've said before on that topic, I don't think that items should expire. Because the performance issue that we see the most that makes Swift responsible is files that are infrequently accessed. Should we apply an expiry to our thumbnail store, it would render it useless: by the time it falls out of varnish, it would be gone from thumbnail storage, making the store useless. The whole point of the thumbnail store should be to keep copies of infrequently accessed thumbnails (it could actually be smart and not store things that never fall out of varnish because they're accessed a ton) that would be costly to regenerate.

Another naive view could be that we might as well always render thumbnails on the fly and have varnish be our only caching layer. We could do that, but the original still has to be pulled form Swift (we need replication to avoid losing originals). Which for large originals can be pretty slow. In that situation too it seems like more time is usually spent pulling the original from swift and shipping it across the network than actually doing the image processing to generate the thumbnail (would have to be verified, but I recall surprisingly poor performance for pulling large files from Swift).

Details

Related Gerrit Patches:
operations/puppet : productionSet thumbnails varnish TTL to 90 days

Event Timeline

MingleTerminator raised the priority of this task from to High.Dec 8 2014, 6:24 PM
MingleTerminator added a project: Multimedia.
Gilles claimed this task.Jun 17 2015, 8:23 AM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 17 2015, 8:23 AM
Gilles renamed this task from Replace SWIFT for thumbnails to Investigate SWIFT replacement for thumbnails.Jun 17 2015, 8:24 AM
Gilles edited projects, added Performance-Team; removed Multimedia.
Gilles set Security to None.
Restricted Application added a project: Multimedia. · View Herald TranscriptJun 17 2015, 8:24 AM
Gilles renamed this task from Investigate SWIFT replacement for thumbnails to Investigate Swift replacement for thumbnails.Jun 17 2015, 8:28 AM
Gilles updated the task description. (Show Details)
Gilles updated the task description. (Show Details)Jun 17 2015, 8:34 AM
Gilles removed a project: Multimedia.
Restricted Application added a project: Multimedia. · View Herald TranscriptJun 17 2015, 8:34 AM
Gilles updated the task description. (Show Details)Jun 17 2015, 8:35 AM
Gilles added a subscriber: faidon.Jun 17 2015, 8:55 AM

Some useful information: varnish misses currently account for 36% of image hits in Media Viewer. Of these at least 85% had already been generated and are merely pulled from Swift.

@faidon how expensive would it be to throw more RAM at the thumbnail varnishes and/or increase the expiry duration (I'm not sure how infrequently accessed thumbnails are being evicted from varnish exactly)? If we somehow managed to bring down the varnish miss figure considerably, to 1% for example (considering that the status quo is 0.6% of image hits having to be generated from the original on the fly + stored in swift), then we could consider the seemingly silly idea of not having a thumbnail store at all. Thumbnails would either be in varnish or generated on the fly.

Its number one quality should be speed for pulling a file. It's all that matters, serving a file to varnish as fast a possible.

Honestly, I'm not sure if I agree with that. If this task is of an exploratory nature, I think a solution that does not involve Varnish at all should be also considered. Serving files off a disk is simple and can be fast enough that adding the complexity of a couple layers of Varnish might not be worth it — the fast disk storage Varnish uses could be used by such a system instead (using e.g. a fast webserver such as nginx). Thumbs can also be considered ephemeral, cache material and be distributed from caching PoPs, despite the legal and hardware restrictions that these sites usually have.

Note that I'm not actually saying that I have this all figured out or that I know this is a better solution — I'm just saying that it's worth it to not be constrained by an requirement that in the end, may be arbitrary :)

As a side note, I've looked at clustered file systems and quite expectedly they all care deeply about replication and making sure that they don't lose data, which in our case would certainly be overhead.

Change 218858 had a related patch set uploaded (by Gilles):
Set thumbnails varnish TTL to 90 days

https://gerrit.wikimedia.org/r/218858

BBlack added a subscriber: BBlack.Jun 17 2015, 11:48 AM

Moving the conversation back here, since gerrit-review on the above patch isn't the place for it:

Set thumbnails varnish TTL to 90 days
In order to verify whether or not we could have a single caching layer
for thumbnails, we can use the existing setup with Varnish to see if
a longer TTL for thumbnails could achieve an acceptable hit ratio.
At the moment varnish thumbs only have (across all DCs) a hit ratio of 64%. I suspect that this is mainly due to the TTL and the fact that many of our thumbs are infrequently accessed (with some sizes seen by users less often than every 30 days). Bumping TTL would allow us to verify that theory, or will show us that it's a capacity problem and the LRU eviction is responsible, which will inform our decision about the capacity required for a new and improved thumbnail cache/store.




I don't really follow the logic of how TTL relates to the number of caching layers we have, or how anything about one would say much definitively about the other. I don't tend to think tripling the Varnish-level TTL is going to have any pragmatic effect on the caches' performance one way or another. Could you expound a bit here on what stats you're looking at (for 64% on thumbs), and whether that's differential to other objects cached in the upload cluster? Also, what's the connection between increasing the maximum TTL and divining whether there's a true capacity "problem" at the varnish layers? Our total capacity in the varnish-be layers is on the order of ~12TB of varnish cache storage.

The point of our varnish layers isn't to be a primary store, or to paper over gross performance or reliability issues in the underlying source. It's just a cache, and IMHO it's doing a fine job at what looks to me like a ~98% global hitrate for requests. These are captures from ganglia of the client request rate to all global frontends, vs the request rate coming out the back side of the eqiad backends to fetch from beneath varnish, showing respective daily peaks on the order of 80K vs 1K. I just don't think whatever problem we're looking for lies here.

{F180222} {F180224}

When users requests thumbnails, 64% of the time they get a hit in varnish. 36% of the time, they don't. The backend then looks at Swift first, and 85% of the time, it's there. But pulling from Swift is pretty slow, it wasn't designed to be fast. The thumbnail only has to be generated by the image scalers for the remaining 15%. Applying those percentages to all image requests, here's the breakdown:

64% found in varnish
31% found in swift
5% generated by image scalers (which pulls the original from swift + stores the resulting thumb in swift in addition to serving the thumb)

What would happen if we got rid of swift for storing thumbnails right now? Image scalers would get 7 times more thumbnail requests (going from handling 5% of thumbnail requests to 36%). The bottleneck to doing that is most likely going to be network requests to Swift to fetch originals, not CPU. If you think image scalers can handle 7 times the traffic to Swift network-wise, then great, let's stop storing thumbnails in Swift right now. Given that last year we ran into situations where batches of large originals was choking the pipe, I figured that wasn't an option, but maybe I'm wrong.

Now, something quite important in the figures above is that 95% of the time thumbnails are either found in varnish or swift. If they're found in swift, they've been in varnish before, they've just been evicted since. I suspect that this is because some thumbnails, especially at particular sizes, are visited just infrequently enough (less than every 30 days on average, I guess) that they keep falling out and having to be picked up from swift. That's inevitable in the grand scheme of things, there are always going to be infrequently visited thumbnails, but this happening for 31% of image requests seems like a bit much.

Another angle we could look into, that I was considering back in my multimedia days, is to reduce the amount of thumbnail size buckets used by Media Viewer. Some of the smallest and highest resolutions are probably amplifying the phenomenon on images that see little traffic. I'll get fresh figures on that.

Overall, I'm interested in whatever can move the needle on that figure of pure varnish hits that don't need to go to swift, as it would allow us to get rid of swift thumbnail storage in a very simple manner. And once all we have is Varnish and image scalers behind it, we can also consider using another cache than Varnish for thumbnails if another technology is more appropriate, and we'll know which TTL value to use to make that cache effective.

The experiment of raising the TTL for thumbnails on varnish will show us whether or not we can simply have Varnish and image scalers behind it with little risk of making that switch happen in terms of vast increase of requests to the image scalers. I wouldn't consider Varnish to become the primary store in that case. It is just the cache, the primary store is re-rendering the thumbnails when we need them. We'd just be getting rid of the secondary store, Swift.

Some stats on performance. Calculated with the geometric mean of event_total on the latest MultimediaViewerNetwirkPerformance EL table. These are figures measured from the users' client side. They include the speed of their own connection.

Thumbnails found in varnish are served to users in 276ms on average
Thumbnails found in swift in 765ms
Thumbnails (re-)generated by image scalers in 2504ms

It's clear that it's bad idea perf-wise to increase the ratio of 5% going to image scalers. If we get rid of Swift storage for thumbnails, though, that last figure of 2504ms would go down a little since those requests wouldn't need to spend time storing the resulting thumb in Swift anymore.

Tgr added a subscriber: Tgr.Jun 17 2015, 8:05 PM

It would be interesting to know how the size distribution differs, ie. is the scaler overhead similarly large on smaller images? For a large image we generate a chain of standard sizes (I think... or was that disabled eventually?), keeping those in swift but regenerating the final thumbnail on the fly might get rid of most of that overhead.

Gilles added a comment.EditedJun 17 2015, 8:32 PM

We couldn't do chaining in the end because it messed with Commons' sacred JPG sharpening. It would have to be revisited to emulate the same sharpening at given sizes.

The scaler overhead varies wildly between bucket sizes, but I think there's a simple explanation: huge images will make the largest buckets possible to generate, while many originals actually have their width fall at some point inside the buckets range.

Maybe we could revisit chaining and just use it to generate a single, say 4096-wide image as the reference for thumbnails when the original is simply gigantic and disproportionately using scaler resources to generate the thumbs.

When users requests thumbnails, 64% of the time they get a hit in varnish

Where does this stat come from? It doesn't jive with our overall hitrate I mentioned earlier. If it's because thumbnails and other things served by the upload-varnish (primary images?) have different rates, we can talk about the possibility of isolating them so that one doesn't stomp the other out of the caches.

When users requests thumbnails, 64% of the time they get a hit in varnish

Where does this stat come from? It doesn't jive with our overall hitrate I mentioned earlier. If it's because thumbnails and other things served by the upload-varnish (primary images?) have different rates, we can talk about the possibility of isolating them so that one doesn't stomp the other out of the caches.

Good thing you asked, since that made me scrutinize my queries a bit more and found a mistake. But as you'll see, it's still quite different than the global hit rate.

My data is coming from the varnish headers measured directly on the clients. Sent through EventLogging to the MultimediaViewerNetworkPerformance_11030254 table on the analytics DB. It's sampled, but the sampling is large, so the ratios shouldn't be affected.

It is possible that thumbnails have different hit rates than the rest. How difficult would it be to isolate thumbnails in their own cache?

I re-ran the queries and realized that I hadn't accounted for NULL entries (oops!) when the headers are absent or can't be read.

SELECT COUNT(*) FROM MultimediaViewerNetworkPerformance_11030254 WHERE event_type = 'image' AND event_varnish1hits IS NOT NULL and event_varnish2hits IS NOT NULL AND event_varnish3hits IS NOT NULL

402408

SELECT COUNT(*) FROM MultimediaViewerNetworkPerformance_11030254 WHERE event_type = 'image' AND event_varnish1hits = 0 and event_varnish2hits = 0 AND event_varnish3hits = 0

67437

(67437 / 402408) = 0.167

Meaning 16.7% of image requests not finding a hit in varnish.

Correcting the figures I calculated earlier, that means thumbnails are found:

83.24% of the time in Varnish, served in 276ms on average
16.62% of the time in Swift, served in 765ms on average
0.14% of the time rendered on image scalers, served in 2507ms on average

Which makes more sense since we're looking at Media Viewer figures and thumbnail sizes for Media viewer should be prerendered at upload time.

If isolating thumbnails in their own Varnish cache could bring the hit rate to 98%, it would be an instant performance win and put us in a position where we can gradually stop looking in Swift for thumbnails and seeing if the image scalers handle the extra load.

I don't know if that would be the case (98% for thumnails). But if thumbnails are only 83%, then the primary images must be 99%+, and it's possible that the storage/hotness of the primaries has an impact on cache hitrate of the thumbnails.

Just to make sure we're on the same page: of the varnish misses (~16%), you're saying that since most of them exist in Swift, they have definitely been hit through varnish before in the past, or they wouldn't have been generated and stored to Swift?

(I should have added above): Or are some of them these pre-renders, where we pre-rendered to Swift before the first user ever tried to hit it?

Just to make sure we're on the same page: of the varnish misses (~16%), you're saying that since most of them exist in Swift, they have definitely been hit through varnish before in the past, or they wouldn't have been generated and stored to Swift?

Yes, exactly. Thumbnail requests always go through Varnish first, including in the pre-rendering case. If they're in Swift, they're guaranteed to have been in Varnish at least once before.

Gilles added a comment.EditedJul 9 2015, 11:40 AM

@BBlack are you ok with isolating thumbnails into their own varnish cache as a first step, to see if the hit rate improves? Should I make a separate task for that?

Not at this time, no. I think that "experiment" would be premature. Splitting thumbnails to their own varnish-level caching would be a pretty massive project with public-facing effects, starting with the splitting of public URLs into e.g. upload.wm.o for primaries and thumbs.wm.o for thumbs, and then rippling on downwards into new LVS addresses, new varnish clusters, and probably new hardware if we don't want to lose the hitrate we have on the primaries during the experiment.

I think an initial step would be to even prove that this problem would be helped by such a move. Try to figure out what percentage of the cache space is being used by primaries vs thumbs. If thumbs are already the majority of the space (rather than the hits), I don't know that splitting them and keeping everything else the same is really going to change things much.

Mostly, I think @faidon's musings earlier in the ticket are the right path to pursue: fix the way we generate/store/deliver thumbs at the beneath-varnish layers (swift/scalers) first. Ultimately that could be done much better with a simpler model for storage and serving on a filesystem. If we had a good design for this, we might find that varnish hitrate isn't as critical as it was, and/or isn't necessary at all (in which case we could perhaps repurpose the hardware to the new solution).

aaron added a subscriber: aaron.EditedJul 17 2015, 8:48 PM

Any replacement should focus on both getting high CDN hits *and* making purging work properly. A few proposals discussed IRL that avoid/reduce using Swift:

a) Do not use Swift for thumbnails at all and track generated thumbnails in some other DB and use them to get the list of CDN urls to purge.
b) Only use Swift for reference size thumbnails (in case of CDN data loss), scale up on-demand non-standard sizes with a very low CDN TTL. Purges only need to go to the a priori reference URLs since nothing else is cached for any real length of time (e.g. seconds), so no thumbnail lists are needed. Non-standard sizes would not be used by MW itself and would be severely discouraged, having to hit thumb.php with minimal CDN caching (seconds).
c) A version of (b) with one difference: the scaling would be done edge size with a layer running at each CDN site that tries to get the reference thumbnails from the local CDN (falling back to MW/Swift) to scale them to the requested size. This whole layer would need a fair amount of coding.
d) Use something like Varnish hashtwo or vcl_hash to bucket thumbnails in CDN to allow "wildcard" purging of all a file's thumbnails at once

Tgr added a comment.Jul 17 2015, 9:18 PM

d) Use something like twohash or vcl_hash to bucket thumbnails in CDN to allow "wilcard" purging of all a file's thumbnails at once

That would be extremely valuable for API cache invalidation as well, which is not really possible ATM.

Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptAug 5 2015, 8:07 AM
Gilles added a comment.Aug 6 2015, 7:18 PM

@aaron see presentation draft above

Gilles added a comment.Sep 1 2015, 2:23 PM

See T110858 for what is the most promising idea to pursue, imho.

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:45 PM
Gilles closed this task as Resolved.Sep 7 2015, 7:45 PM

Following the very useful brown bag, I've decided to pursue this issue further with a specific architecture that seems to solve all of our problems, by implementing it on VM, as a working prototype to push towards a production solution: T111718: Service-based thumbnailing re-architecture on Vagrant

Change 218858 abandoned by Yuvipanda:
Set thumbnails varnish TTL to 90 days

Reason:
Abandoning as per my reading of the ticket. Do restore if that is wrong. Thanks.

https://gerrit.wikimedia.org/r/218858