Reduce amount of thumbnail buckets
Closed, ResolvedPublic

Description

We put that off before because we were waiting to have time to work on rectangular buckets, but this is having a performance impact right now that could be easily fixed by collapsing some of the buckets together.

By being infrequently accessed, some of the buckets are increasing the likelihood of thumbnails having to be pulled from Swift instead of being found in Varnish. Pulling from Swift makes for requests that are 3 times slower on average. Meaning that we can afford to serve slightly larger images, for example, if it greatly increases their likelihood of being served from Varnish.

Here are some stats about which bucket sizes are most encountered in Varnish misses that need to pull the thumbnail from Swift:

SELECT EXP(AVG(LOG(event_contentLength))) AS avgsize, COUNT(*) AS count, event_imageWidth FROM MultimediaViewerNetworkPerformance_11030254 WHERE event_type = 'image' AND event_varnish1hits = 0 and event_varnish2hits = 0 AND event_varnish3hits = 0 AND event_timestamp - event_lastModified > 60 GROUP BY event_imageWidth ORDER BY count DESC

177045.0831227364 12005 1024
115380.94911617086 10328 640
127670.17162169864 9549 800
243925.1228200158 9076 1280
463829.1836617856 4880 1920
43604.204761961795 1325 320
679914.8144343277 973 2560
734863.5032724637 217 2880

The least controversial merges are thus probably:

640 & 800 => would only serve images that are 11% larger on average for hits that would have gone to 640.

1024 & 1280 => images that would have hit the 1024 bucket will be 38% larger on average, but 1024 is the biggest offender in sizes hitting Swift rather than Varnish

I propose to merge 640 & 800 first and to measure the exact gain. Then we'll be able to assess if merging 1024 & 1280 will be a net gain or not (merging them doesn't mean that they will always hit varnish, meaning that the 3-fold perf advantage will be lower in practice).

Gilles created this task.Jun 17 2015, 8:05 PM
Gilles updated the task description. (Show Details)
Gilles raised the priority of this task from to High.
Gilles claimed this task.
Restricted Application added a project: Multimedia. · View Herald TranscriptJun 17 2015, 8:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gilles updated the task description. (Show Details)
Gilles set Security to None.
Gilles added a subscriber: Tgr.
Gilles added a subscriber: BBlack.

Change 219018 had a related patch set uploaded (by Gilles):
Remove the 640 bucket

https://gerrit.wikimedia.org/r/219018

Krinkle moved this task from Inbox to Backlog on the Performance-Team board.Jun 18 2015, 12:13 AM

Change 219018 merged by jenkins-bot:
Remove the 640 bucket

https://gerrit.wikimedia.org/r/219018

Gilles moved this task from Backlog to Doing on the Performance-Team board.Jun 18 2015, 7:19 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 26 2015, 6:11 AM
Gilles added a comment.Jul 6 2015, 4:46 PM

Some initial results:

Before:

171037.1442673806 557 1024
112707.74436101079 439 640
135837.50678229952 431 800
245228.33825195447 388 1280
477496.2129087295 200 1920
42981.1881311492 67 320
660968.2979245667 52 2560
99744.73812967404 11 600
91681.75026639248 8 480
705133.1205430947 8 1600
149019.41989533562 8 1000
569738.5291030498 7 2880

After:

142400.77423277826 559 800
185618.5487605837 406 1024
249683.7574359855 341 1280
478955.0522166842 159 1920
893628.6477293095 32 2560
39496.08148952671 31 320
112874.28366477707 22 640
107556.38973233252 9 600
490427.7210649032 8 1600
598187.8054669301 8 2880

It's surprising that we're still seeing a few hits on 640 at all, but then again that table is full of unexplained hits at random sizes that don't correspond to the buckets.

1024 is slower for no reason. Looking at 800 in relative terms to 1024, before the change 800 was taking 79.4% of the time 1024 was taking, and after the change it's taking 76.7%. At first glance it looks like a small win or no change. I'll re-run the calculations in a few weeks when we have more data, as 1024's worsening is a little puzzling, considering that the other sizes (320, 1280, 1920, 2880) are very similar to what they used to be.

BBlack added a comment.Jul 6 2015, 5:03 PM

It may be that there are 640 links in cached HTML, which could take ~30 days to fall out of varnish completely.

It may be that there are 640 links in cached HTML, which could take ~30 days to fall out of varnish completely.

I'm looking at requests coming from Media Viewer, which requests sizes through some JS logic unaffected by page caching. I've just realized why we're seeing these random sizes and 640 still. It's for images whose maximum (original) width happens to be 640, displayed for users who have a resolution greater than 640. In that situation it would be silly to round down the the smaller bucket, which is why we display the original. Expected behavior, then.

I'm still seeing the same discrepancies for sizes that should be unaffected in the period just before the change and since the change. It's quite possible that there's too much noise and the sampling is too low for us to make proper before/after conclusions. I'll wait a few more weeks just in case.

Gilles moved this task from Doing to Backlog on the Performance-Team board.Jul 21 2015, 8:47 PM
Gilles closed this task as Resolved.Aug 5 2015, 8:07 AM

Results are still inclusive with the samples data, I'm closing this. We're now tracking data unsampled in graphite, the effect of future tweaks to the bucket list should be clearer.