Page MenuHomePhabricator

Cache thumbs in our caching infrastructure (e.g. ATS)
Open, Needs TriagePublic

Assigned To
None
Authored By
MatthewVernon
Aug 31 2023, 10:22 AM
Referenced Files
F49726172: top thumb hitters log-log full.png
May 2 2024, 1:43 AM
F43060390: long_tail_Theil_Sen.png
Mar 22 2024, 5:07 PM
F43060313: long_tail_extrapolation.png
Mar 22 2024, 5:07 PM
F43060232: long_tail_nonsense.png
Mar 22 2024, 5:07 PM
F43051129: predictions.png
Mar 22 2024, 3:14 PM
F43050901: test.png
Mar 22 2024, 3:14 PM

Description

To reduce the load on thumbor, we store generated thumbnails in swift. This is in theory a cache, but in fact there is no expiry of thumbnails, meaning that now about ⅓ of our swift capacity is spent on storing thumbnails - about 540TB across both data centres. As well as being an inefficient use of swift capacity, the very large number of objects involved (nearly 4 billion) means that the on-disk databases that correspond to these swift containers are large and unwieldy. A further problem is that we are storing thumbnails that relate to original objects that have been deleted (for copyright or other legal reasons), which we should not be keeping.

These are largely architectural consequences of using swift for a purpose to which it is not really suited - we are attempting to use a persistent storage system for caching. Instead, we propose to use our existing caching infrastructure to cache thumbnails; the eventual outcome being to remove swift from the thumbnailing process entirely. Instead, ATS will cache thumbnails, and get thumbor to (re-)generate thumbnails that are cache misses.

To achieve this, we propose that ATS gradually start caching thumbnails for longer; increased storage use (and reduced request rate to swift-and-thumbor) would be monitored until we reach the point were swift storage of thumbnails is no longer necessary, at which point ATS would talk to thumbor directly when it needed a thumbnail not already in cache. The end result would be simpler infrastructure, and more efficient use of swift capacity.

This is a proposal that arose out of discussion regarding T211661 (and the deficiencies of an approach based on simply turning on object expiry for thumbnails, or of updating an expiry date when a thumb is re-accessed); while we could try and write a new LRU-based expiry process as some sort of sidecar, it seemed better to see if we could use our existing caching infrastructure to cache enough thumbnails that we wouldn't need swift for this use any more (and potentially could take swift out of the thumbnail request path entirely).

Event Timeline

[I spoke to @KOfori about this, and they suggested opening a phab task tagged traffic was the best next step]

Vgutierrez subscribed.

Happy to provide assistance and guidance if needed but caching is technically controlled by the backend services and not by the CDN.
the CDN imposes some limits on what's cacheable and for how long (it will cap the TTL to 24h and flag it as uncacheable if it's bigger than 1Gb for example) but cacheability itself is managed by the Cache-Control header set by swift (the upload cluster doesn't talk to Thumbor directly).

Right now (just picked a thumbnail that's currently being rendered on /wiki/Main_Page for en.w.o):

vgutierrez@cp6001:~$ curl --connect-to upload.wikimedia.org:443:swift.discovery.wmnet https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/General_Grant_National_Memorial_New_York_November_2016_003.jpg/162px-General_Grant_National_Memorial_New_York_November_2016_003.jpg -o /dev/null -v -s 2>&1 |grep -i cache-control
< Cache-Control: public, max-age=86400

swift is returning a Cache-Control header signaling the CDN that the thumbnail can be cached during 24 hours. Requesting the same object to the CDN shows that the response is successfully being served from the CDN without impacting swift:

vgutierrez@carrot:~$ curl https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/General_Grant_National_Memorial_New_York_November_2016_003.jpg/162px-General_Grant_National_Memorial_New_York_November_2016_003.jpg -v -o /dev/null -s 2>&1 |egrep "x-cache|age:"
< age: 39570
< x-cache: cp6003 miss, cp6003 hit/1840
< x-cache-status: hit-front

Thanks. Is there any way that TTL cap could be raised for thumbnails?

Cache revalidation can further extend this period. After the initial 24-hour limit has passed, ATS will issue a conditional request to the backend. If the backend supports it, a 304 response should be returned, eliminating the need to resend the object.

Right, and swift does say 304 quite a lot; but that isn't very helpful for thumbs - swift can only say 304 because it's storing all these thumbs indefinitely, and that's what we're trying to get away from. Likewise, thumbor itself couldn't say 304 without regenerating the thumbnail, rather defeating the object of the exercise. Hence my wondering if we could arrange for the CDN to allow a higher TTL cap for thumbs.

Okay, I might not know what's going on so my apologies if I'm misunderstanding something.

The front of CDN is not the target here, we were told that CDN backend (ATS? not the service backend like thumbor or swift) is persisted on SSD and can last longer. If that's the case, the plan would be something along the lines of caching in backend (ATS, not thumbor) for longer time than the frontend. Let's say 1 month or 3 months and then front could cache for a day or a week.

That would require you to actually increase the size of the cluster but we probably can hand over a couple of swift hosts (as this is 1/3rd of the storage) and cut swift from the middle.

If that's not possible, we can start taking a look at building a caching cluster similar to ATS and use that instead of swift. Any ideas are more than welcome.

This topic probably deserves a ~hour meeting w/ Traffic to hash out some of the potential solutions and tradeoffs, but I'm gonna try to bullet-point my way through a few points for now anyways to seed further discussion:

  1. I don't think it would make much sense to try to add old Swift hardware into our ATS clusters. There's a lot of considerations there about the hardware config (what kinds of CPU, mem, and disks we use in the edge clusters), and also we'd have to do it at all 6 (soon to be 7) sites with equal storage in each (multiplying the needs + costs) and then we start also talking about rack space and power limits at the sites, our hardware refresh cycles, etc. I'm not opposed to expanding our ATS storage in the future if it makes sense in light of all tradeoffs, but it would probably have to happen holistically in terms of our overall edge design, and then gradually roll in over ~5 years of hardware refresh cycles. The nodes we're using now are maxed out in terms of off-the-shelf very-high-speed storage. We'd have to use slower storage to get more of it, at least (or move to a whole different kind of server node).
  1. The topic of expanding our ATS TTL caps beyond 24h is an intriguing one in general. It may statistically help with some of the relevant numbers here, but it's also mostly an orthogonal topic that we should be pursuing for all content (and has been a topic recently in our chat). TL;DR - Historically we had cache TTL caps that were up to a month long, but they were reduced years ago, mostly to avoid issues around the unreliability of content purging. We could probably bring them back up to at least a week or two, given that content purging is now more reliable (kafka + purged vs multicast + vhtcpd), and thus further reduce some of our 304 traffic. We can/should pursue this in parallel regardless. For reference, recently it takes ~2-3 weeks just to fill our ATS backend caches from an empty, cold state. Also keep in mind that our caches are distributed and separated - it's not one shared global edge cache, it's 48 of them (some sites use hashing to split, some are single-backend and do not). If we fast-forward a year or two (most sites in single-backend config, and our 7th site is online), we'd have 56 separate edge caches in the globe for this content. If a previously unknown/uncached thumb URL suddenly becomes hot (and is cacheable!), we'd expect all 56 of them to independently request it of Swift (or Thumbor).
  1. Our edge caches will always "just" be caches. The primary aim of this infrastructure is to reduce latency to the user and improve their experience, not to reduce load on interior clusters. Load reduction is a happy side effect we've come to rely on, perhaps to an unwise degree (because it can never be guaranteed that it will work all the time (attacks, odd patterns of abuse, etc)). The edge caches will also never capture all of the long tail, which is quite long in the case of commons media + thumbs.

In the general spirit of "we're always operating on the engines of the airplane while it's in flight", I'd also generally recommend against making a big sweeping change with unknown implications. We should probably attack this in a measured, gradual way so that we can have time to evaluate the impact of changes (including during rare attack / misuse events). The primary tradeoff here is basically going to be with Thumbor load. The less thumbs we have stored in Swift, the more thumbor load will increase, according to some curve that's difficult to predict. Perhaps we could hash the thumbnail URL misses in ATS and decide to send X% (starting very small) of them directly to thumbor instead of swift, as an experiment to see the load increases there? Even then, though, I'd be worried about cold caches and spikes melting thumbor in odd cases, at least until we have an easy way to operationally ramp in cold caches more-slowly than we do today.

An interesting data point (that I didn't see directly in the other ticket, at least in a quick scan!) would be some idea of the curve of "it takes X amount of thumbnail storage to satisfy Y% of thumbnail requests from that storage" (which is applicable to any cache or media storage). It would give us an idea whether we're orders of magnitude off for any kind of scheme we might dream up here.

Thanks for this; I agree that we should probably (virtually) sit down and talk about this; I wanted to try and make sure we had most of the obvious "we could answer this with data, but it'd take time" questions answered first in the hopes of making best use of everyone's time. A few brief follow-ups:

Apropos purging, swift has this problem writ large, in the sense that anything that isn't correctly purged will remain forever (and the idiosyncratic request path makes it hard to force a regeneration, since thumbor only gets involved via a custom 404 handler).

Definitely agree that gradual progress towards a new end-state rather than a flag day is the way to go (sorry if that wasn't clear enough in my intial writing).

Now thumbor is on k8s, we presumably have greater flexibility as to the amount of resource available to it (though obviously we still have finite computing resource, and we can't have "use k8s for spikes" as a general policy across services since those spikes are likely to be correlated).

I'm not quite sure I see how one might approach your final question statistically. I could tell you what proportion of requests (on our sample day) were 304s, which would give you some idea of the impact that increasing the TTL for thumbs has on requests to swift. I can tell you that on our sample day, we served 4.5% of our stored thumbnails; I could pick another sample day and repeat the analysis (and also see how much overlap there was), but I still don't think that's helpful in answering your question...

An interesting data point (that I didn't see directly in the other ticket, at least in a quick scan!) would be some idea of the curve of "it takes X amount of thumbnail storage to satisfy Y% of thumbnail requests from that storage" (which is applicable to any cache or media storage). It would give us an idea whether we're orders of magnitude off for any kind of scheme we might dream up here.

@BBlack I don't know if you had any thoughts about how we might answer your question [for reasons I note in my previous comment, I'm not sure it's straightforward]? I wanted to try and arrange that relevant questions-we-could-answer-from-data were answered before arranging a meeting to talk about this (in the hopes of making that discussion as well-informed as possible).

Midleading changed the task status from Open to Stalled.Jan 19 2024, 4:31 AM
Midleading subscribed.

Thumbor is currently heavily overloaded (T337649). As a result, traffic to thumbor should be reduced as much as possible until things improve.

Thumbor is currently heavily overloaded (T337649). As a result, traffic to thumbor should be reduced as much as possible until things improve.

Thumbor is not currently overloaded to the best of our knowledge. If you are seeing issues, please file a ticket with specific details.

taavi changed the task status from Stalled to Open.Jan 19 2024, 12:58 PM

I decided to take a look at numbers of top hitters up to the point it would fill ATS (I went to top 5m objects until it borked). Unfortunately it's so big that even big data hadoop is like "nope I can't do that" and I respect that.

If you want to try and break hadoop yourself, it'd be something like this:

select sum(hitcount), sum(rp_size) from (select cache_status, count(*) as hitcount, mean(response_size) as rp_size from wmf.webrequest where webrequest_source = 'upload' and year = 2024 and month = 2 and http_status = 200 and uri_path like '/wikipedia/%/thumb/%' and cache_status in ('hit-local', 'miss') group by uri_path, cache_status order by hitcount desc limit 50000);

Another way to do this is to get top 10,000 and extrapolate. Let me try that.

One thing that was discussed at the SRE meeting in Warsaw was looking at turnilo data (which IIRC is the last 90 days' requests) to effectively simulate a cache and ask questions about the relationship between cache size/age and hit/miss ratios and so on.
[which might be a useful KR for the forthcoming quarter]

So I looked at some numbers for February:

ladsgroup@stat1005:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 -e "select count(distinct uri_path) as hitcount, mean(response_size) as rp_size from wmf.webrequest where webrequest_source = 'upload' and year = 2024 and month = 2 and http_status = 200 and uri_path like '/wikipedia/%/thumb/%' and cache_status in ('hit-local', 'miss') group by cache_status order by hitcount desc limit 500;"

Yields:

hitcount	rp_size
588925412	72361.5575153386
196828030	54660.057726232684
Time taken: 2509.541 seconds, Fetched 2 row(s)

That basically mean if we want to absorb all swift hits for a month, we need roughly 10TB more storage for each ATS.

It's a bit hard to measure in terms of cutoffs since we would have to keep track of 600M image names and their hitcount. I tried getting top 50K and extrapolate the rest:

ladsgroup@stat1005:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 -e "select count(*) as hitcount, mean(response_size) as rp_size from wmf.webrequest where webrequest_source = 'upload' and year = 2024 and month = 2 and http_status = 200 and uri_path like '/wikipedia/%/thumb/%' and cache_status in ('hit-local', 'miss') group by uri_path, cache_status order by hitcount desc limit 50000;" > top_thumb_hitters_T345334

It looks like this:

test.png (480×640 px, 16 KB)

(the part after 50K is just me trying a linear extrapolation)

The problem is that it has such a long tail that any prediction goes to 0 hits after 1M images. If you give the whole dataset (of 50K), the spline predictions become nonsensical, if you include only from 20,000th item onwards, then it's still better but still hits zero rather quickly:

predictions.png (969×960 px, 37 KB)

A linear regression of the 49,000th images hits zero after 1M image.

Let me re-do the calculation with only swift hits and see what would be the extra. That way calculations won't become so absurd.

So for "miss" (=swift/thumbor hits). The top hitter gets 750 in the whole month. Quickly it settles to ~130 a month. This results to any spline extrapolation to be nonsensical as well:

long_tail_nonsense.png (969×1 px, 51 KB)

If we do extrapolation after 10,000th hit. The Theil-Sen extrapolation becomes more useful:

long_tail_extrapolation.png (969×1 px, 45 KB)

But that also reaches zero after 5M (while we know we have 200M different thumbnails to swift/thumbor):

long_tail_Theil_Sen.png (969×1 px, 34 KB)

So the simplest is to see is boundary problem and say it's 130 hits at zero and zero at 200M and draw a line.
That means the first extra TB will lead to storage for 15.8M million more thumbs (the average size being 68KB) but absorbs ~1.95B hits to swift monthly (780 req/s), that alone should halve the hits to swift. The second TB reduces 1.6B hits, and son on.

From what I'm seeing the long tail is too big to be solvable by 1 or 2TB extra storage, even 2TB extra still will make thumbor melt (it'll require 6x more hw to be able to keep up) if we remove swift from the equation (and I'm not counting the refreshes of these caches into account)

If we do extrapolation after 10,000th hit. The Theil-Sen extrapolation becomes more useful:
So the simplest is to see is boundary problem and say it's 130 hits at zero and zero at 200M and draw a line.

If you sort images by popularity, and draw a chart with the rank of each image on the horizontal axis, and the hit rate of the image on the vertical axis, then a power law should be a good first approximation to the shape of the curve. If you take the logarithm of both axes, you should have a straight line. Then you can do your linear regression. It doesn't make sense to do a linear regression on plain hit rate.

You'll have something like y=c x^-k. Take the log of both sides and you have log y = -k ( log(c) + log(x) ). That's a line where the slope is -k and the y-intercept is -k log(c).

If you want to know the miss rate if you limit thumbnails to say 10M images, as a first approximation you can just integrate the power law model from a rank of 10M to infinity. There's no need for discontinuities since the integral converges.

That'd work on overall hits, as you said "sort images by popularity". That's not the case here. Front caches absorb all of the hits and collapse the power law distribution. The highest number of hits to backend caches is quite small and more linear after you take out some images that for whatever reason got a lot of purge or were not cached on the front caches (size, etc.)

That'd work on overall hits, as you said "sort images by popularity". That's not the case here. Front caches absorb all of the hits and collapse the power law distribution.

In the model I'm proposing, the power law distribution applies to the frontend request stream. The caches absorb all hits with a rank less than the cache size, and thumbor sees the remaining requests, which is still a power law.

I tried making a log/log plot of your file top_thumb_hitters_T345334:

top thumb hitters log-log full.png (807×1 px, 31 KB)

but the slope parameter k is less than 1, so it's not converging. It needs to be greater than 1 to be integrable. The slope at log(rank) > 8, taking data from rank 3000 to 50000, is around -0.1, so it's a long way off.

Due to T266155, I have to keep refreshing the category page, about 5~10 times, until all 200 thumbnails are generated. Therefore some "cache hits" are actually caused by cache misses. Especially true for those newly uploaded files, or when thumbnails are deleted but Thumbor fails to regenerate them.