(Idea by @BBlack. I'm mostly documenting the discussions.)
Currently, 81% of the requests to the upload cluster are to thumbnails. The proposal is to move these requests to be served from text cluster instead. Basically changing the separation from "text" vs. "upload" to "main" vs. "aux" instead.
Why?
- Having one large cluster gives us elasticity. Meaning we will have better tolerance towards depooled hosts and spikes
- Thumbs are by nature smaller and are being hurt by originals and large files displacing them out of memory. Meaning, moving them to text and assuming the same capacity, the cache hit ratio should go up.
- This would increase our resilience against scrapers which by and large mostly hit originals leading to outage of all of upload cluster leading to degraded user experience.
- Thumbs are mostly human traffic while originals is most scrapers (47% of the requests to originals have browser score of 80 or above, that number is 60% for thumbs). This allows us to rate limit originals more strictly.
- It would also allows us to set network QoS for upload to be something lower to avoid overwhelming backhauls when originals are being scraped to death while still giving the first class experience to human traffic.
Technical implementation:
- We should set up a service in text calling it thumb.wikimedia.org and point it to the same backend in swift.
- This basically means we will switch URLs to our thumbnails from https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png to https://thumb.wikimedia.org/wikipedia/commons/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png
- I don't know how hard it would be to get rid of /thumb/ in the config, if it's used by swift, then we can keep it, otherwise shorter urls are better.
- This basically means we will switch URLs to our thumbnails from https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png to https://thumb.wikimedia.org/wikipedia/commons/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png
- Make sure our rate limits and requestctl actions have been added to text too.
- Make the mediawiki change to point to the new URL scheme instead of the old ones in configured ratio of files (similar to how we gradually rolled out the bump in default thumbnail size)
- Test, make sure it works, slowly ramp up percentage, once in a while, move cp some hosts from upload to text cluster.
- Migrate tools and gadgets that hit the thumbnail url directly (this thankfully can be automated)
- Once done, simplify requestctl rules.
Notes:
- Moving to thumb.wikimedia.org also allows us to move the caching to somewhere else easily if/when the need arises (for example a fully dedicated edge cluster, etc.)
- That leaves upload with these: originals, transcodes, maps
- We could eventually move MPEG-DASH files to text too, they are similar in nature. I have no knowledge on maps infra so I can't say anything about moving those.