Page MenuHomePhabricator

Delete non-used and/or non-requested thumbnail sizes periodically
Closed, ResolvedPublic

Description

As of Apr 2017 thumbnails stored in swift are never deleted unless purged by mediawiki. This is useful in the general case but wasteful long-term, there is likely a lot (TBD) of thumbnails we can delete according to various criteria. For example:

  • We only have e.g. less than 200 thumbnails for a given size
  • Size hasn't been requested in the last 90 days
  • Obviously huge sizes, e.g. wider than 10^5 pixels

Mediawiki doesn't know anything about what sizes have been stored or requested, therefore we'll have to walk the swift containers for deletion candidates. We can reuse thumb-stats from operations/software that does exactly this: walks all thumbnail containers and extract statistics about file size / dimensions and so on.

note the "pixels" and "size" in this context refer to thumbnail width, i.e. the pixel count that shows up in thumbnail URL

Event Timeline

I started doing some analytics with hive on webrequest data for upload, reporting the queries here for reference. Note that running a query over a month of data took ~1h, writing the query into another table allows for faster querying/processing later.

create table filippo.thumb_pixels_201704  as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as pixels, response_size from wmf.webrequest where webrequest_source='upload' and year = 2017 and month = 4  and uri_path like '%/thumb/%px-%';

insert overwrite directory '/user/filippo/thumb_pixel' row format delimited fields terminated by ' ' select pixels, count(pixels) as count, avg(response_size) as avg_size from thumb_pixels_201704 group by pixels;

And a rough estimation of the long tail, note that ~60% of sizes have been requested less than 1000 times in april. Only 4% of sizes are requested more than once per second (on average in april)

stat1004:/mnt/hdfs/user/filippo/thumb_pixel$ cat * | sort -k2 -rn > ~/pixel_count_avgsize
$ wc -l ~/pixel_count_avgsize 
10528 /home/filippo/pixel_count_avgsize
$ for i in 1000 10000 50000 150000 1100000 2200000 ; do
> echo -n "$i => " ; awk "\$2 > $i {print}" ~/pixel_count_avgsize | wc -l ; done
1000 => 3032
10000 => 1847
50000 => 1301
150000 => 984
1100000 => 572
2200000 => 432

Some more frequency distributions of width in pixels vs number of requests to cache_upload during April using bitly's data hacks

1$ cat pixel_count_avgsize |grep -v '146.2784580498866' | awk '{print $1 " " $2}' | ~/.local/bin/histogram.py -A --max 5000 -p
2# NumSamples = 82942062200; Min = 0.00; Max = 5000.00
3# 830617 values outside of min/max
4# Mean = inf; Variance = inf; SD = inf; Median 5227.000000
5# each ∎ represents a count of 1054729691
6 0.0000 - 500.0000 [79104726879]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.37%)
7 500.0000 - 1000.0000 [3055758594]: ∎∎ (3.68%)
8 1000.0000 - 1500.0000 [607802792]: (0.73%)
9 1500.0000 - 2000.0000 [151078118]: (0.18%)
10 2000.0000 - 2500.0000 [6672115]: (0.01%)
11 2500.0000 - 3000.0000 [12250857]: (0.01%)
12 3000.0000 - 3500.0000 [986989]: (0.00%)
13 3500.0000 - 4000.0000 [978102]: (0.00%)
14 4000.0000 - 4500.0000 [374602]: (0.00%)
15 4500.0000 - 5000.0000 [602535]: (0.00%)
16
17$ cat pixel_count_avgsize |grep -v '146.2784580498866' | awk '{print $1 " " $2}' | ~/.local/bin/histogram.py -A --max 1500 -p
18# NumSamples = 82942062200; Min = 0.00; Max = 1500.00
19# 173773935 values outside of min/max
20# Mean = inf; Variance = inf; SD = inf; Median 5227.000000
21# each ∎ represents a count of 608377288
22 0.0000 - 150.0000 [45628296609]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (55.01%)
23 150.0000 - 300.0000 [28113428738]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (33.90%)
24 300.0000 - 450.0000 [4825666280]: ∎∎∎∎∎∎∎ (5.82%)
25 450.0000 - 600.0000 [1110153755]: ∎ (1.34%)
26 600.0000 - 750.0000 [1978740711]: ∎∎∎ (2.39%)
27 750.0000 - 900.0000 [444815750]: (0.54%)
28 900.0000 - 1050.0000 [208827229]: (0.25%)
29 1050.0000 - 1200.0000 [236027095]: (0.28%)
30 1200.0000 - 1350.0000 [209377978]: (0.25%)
31 1350.0000 - 1500.0000 [12954120]: (0.02%)
32
33$ cat pixel_count_avgsize |grep -v '146.2784580498866' | awk '{print $1 " " $2}' | ~/.local/bin/histogram.py -A --max 750 -p
34# NumSamples = 82942062200; Min = 0.00; Max = 750.00
35# 1285776107 values outside of min/max
36# Mean = inf; Variance = inf; SD = inf; Median 5227.000000
37# each ∎ represents a count of 443747644
38 0.0000 - 75.0000 [33281073339]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (40.13%)
39 75.0000 - 150.0000 [12347223270]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (14.89%)
40 150.0000 - 225.0000 [20496799287]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (24.71%)
41 225.0000 - 300.0000 [7616629451]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.18%)
42 300.0000 - 375.0000 [2831749660]: ∎∎∎∎∎∎ (3.41%)
43 375.0000 - 450.0000 [1993916620]: ∎∎∎∎ (2.40%)
44 450.0000 - 525.0000 [664318218]: ∎ (0.80%)
45 525.0000 - 600.0000 [445835537]: ∎ (0.54%)
46 600.0000 - 675.0000 [467549947]: ∎ (0.56%)
47 675.0000 - 750.0000 [1511190764]: ∎∎∎ (1.82%)

Change 351793 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/software@master] thumbstats: add Hive export

https://gerrit.wikimedia.org/r/351793

Change 351793 merged by Filippo Giunchedi:
[operations/software@master] thumbstats: add Hive export

https://gerrit.wikimedia.org/r/351793

I've extracted some data from the list of thumbnails we are storing in swift and processed it with hive, distribution of size vs number of thumbnails we store (more decent graphs coming)

$ sort -n pixel_count_size | grep -v -- '\N' | awk '{ print $1 " " $2 }' | ~/.local/bin/histogram.py -A --max 5000 -p
# NumSamples = 932405127; Min = 0.00; Max = 5000.00
# 370902 values outside of min/max
# Mean = 146.773189; Variance = 911349252814.819092; SD = 954646.140104; Median 7048.000000
# each ∎ represents a count of 8650312
    0.0000 -   500.0000 [648773431]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (69.58%)
  500.0000 -  1000.0000 [156884854]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (16.83%)
 1000.0000 -  1500.0000 [76993267]: ∎∎∎∎∎∎∎∎ (8.26%)
 1500.0000 -  2000.0000 [24060769]: ∎∎ (2.58%)
 2000.0000 -  2500.0000 [4634774]:  (0.50%)
 2500.0000 -  3000.0000 [18740364]: ∎∎ (2.01%)
 3000.0000 -  3500.0000 [842644]:  (0.09%)
 3500.0000 -  4000.0000 [357358]:  (0.04%)
 4000.0000 -  4500.0000 [459133]:  (0.05%)
 4500.0000 -  5000.0000 [287631]:  (0.03%)

And size vs bytes stored

$ sort -n pixel_count_size | grep -v -- '\N' | awk '{ print $1 " " $3 }' | ~/.local/bin/histogram.py -A --max 5000 -p
# NumSamples = 93158107977469; Min = 0.00; Max = 5000.00
# 1119622544767 values outside of min/max
# Mean = 0.001469; Variance = 9892379536331.593750; SD = 3145215.340216; Median 7048.000000
# each ∎ represents a count of 264775386184
    0.0000 -   500.0000 [12839588070271]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.78%)
  500.0000 -  1000.0000 [19220388721260]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (20.63%)
 1000.0000 -  1500.0000 [19858153963807]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (21.32%)
 1500.0000 -  2000.0000 [12166146568579]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.06%)
 2000.0000 -  2500.0000 [4265428797593]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (4.58%)
 2500.0000 -  3000.0000 [19725491234058]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (21.17%)
 3000.0000 -  3500.0000 [1441196798896]: ∎∎∎∎∎ (1.55%)
 3500.0000 -  4000.0000 [787038598971]: ∎∎ (0.84%)
 4000.0000 -  4500.0000 [920468982555]: ∎∎∎ (0.99%)
 4500.0000 -  5000.0000 [814583696712]: ∎∎∎ (0.87%)

What "size" are we talking about when you have a range like 0.0000 - 500.0000?

@Gilles the image width in pixels, i.e. the user-provided size in the url

OK. I'm not sure it makes sense to group them in ranges, we should be looking at widths individually. I.e. 1024 is probably very prevalent, while 1025 isn't, and if you're going by ranges they'd be in the same bucket.

@Gilles indeed for "number of thumbnails stored" I think it makes sense to extract a topN. Anyways I'm still thinking through the criterias that might make sense for deletion, if you have ideas and suggestions please let me know!

So far one thing that jumped my eye is that width >= 2000 pixels is practically never requested, yet sizes >= 2000 pixels account for ~30% of space used on swift, so tackling that might be a good first candidate for space savings.

Indeed, sounds like an easy win

@Gilles indeed ! I'll figure out the best way to get a list of objects to delete and batch-delete

I took another look at the data from a "topN" perspective, comparing widths stored vs widths requested:

Stored in swift:

WidthCountTotal bytes
28801196484312831311775718
1920112615546425015169533
1024289571255976511708956
1280207850375873474216733
800316826814465070770146
256050715164395627257010
640289976132734005332490
1200104023242729282111853
160035037461382023698588
60011023267977115815901
32033266285945343491331
20481183588843496615380
50011243190677898639126
18044676093535302842424
10002895047517840202567
15991329183501813702161
30017296731486523098940
7683044361484272524988
40011014907482515840480
4508236015465052848325
20001274958457963673795
24027997194456221478962
5000141739436817216837
7203455190411348308330
3072234411329534967327
1707681331326969562468
4407110046324421309605
2481318200314195236094
1440601393303502728402
4096162046301493984305
5124067290300585251356

Widths requested on cache_upload during 201704

WidthRequestsAverage size
2201160600184418290.62334
204136250783603.4446222
233705096914350.4100487
160291165516410437.8238
12028187127677015.523756
4025693917901274.372662
250243432152023915.39365
200232135433116432.00934
222213272469393.5257919
3019598885441179.878229
300180432535829569.26509
10016585065556745.631721
8016206170954712.440638
5015786335602371.793851
720141845460582160.62129
170126243085115548.34542
150121727905611293.65764
251166032448864.6937873
330112096293333548.49958
440110025882349460.84618
4510549914451389.348252
161043468178620.1059713
359164380361161.290896
15896978994642.609749
18080535043413617.88147
607913695852933.196868
24077527713820032.23698
44773328259780.1695929
24766488859994.1352453
46728860255738.9990126
326994038791141.116107
32065619592022901.20572
905584992305817.858014
12550757831483.8175996
28048504004127951.14572
18448330824689.318727

Observations:

  1. We can stop pre-generating 2880px and 2560px I think, they take up a lot of space but have been requested 1283304 / 7774542 times and (537th place and 334th place)
  2. The second-biggest size (1920px) should stay, requested 63241046 times (119th place)

The data is also available on this google spreadsheet

The top 100 most requested sizes represent 91.21% of all requests. The remaining long tail (any size not in that 100 whitelist) represents 68% of the storage size.

It would be interesting to know for the long tail how many requests are complete varnish misses going to swift. We would know exactly how much extra traffic would go to the image scalers if we only stored the 100 most requested sizes in Swift. Can you generate that data for all sizes? (# of full Varnish misses going to Swift for 201704).

The top 100 most requested sizes represent 91.21% of all requests. The remaining long tail (any size not in that 100 whitelist) represents 68% of the storage size.

It would be interesting to know for the long tail how many requests are complete varnish misses going to swift. We would know exactly how much extra traffic would go to the image scalers if we only stored the 100 most requested sizes in Swift. Can you generate that data for all sizes? (# of full Varnish misses going to Swift for 201704).

Indeed, I'm creating another hive table from webrequest including cache_status which should be able to tell us what happened to a particular thumb

create table filippo.webrequest_upload_pixel_response_cache_201704  as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as pixels, response_size, cache_status, uri_path from wmf.webrequest where webrequest_source='upload' and year = 2017 and month = 4  and uri_path like '%/thumb/%px-%';

@Gilles I've added two more sheets for hit and miss+pass from webrequest to the spreadsheet, looks like in April there were ~400M misses for sizes non-top100

400m misses means 154 requests per second. It would at least triple the load on Thumbor. Might be possible if/once we've repurposed all existing image scalers to Thumbor.

Change 353244 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/mediawiki-config@master] Stop prerendering thumbs at 2560/2880 pixels

https://gerrit.wikimedia.org/r/353244

Change 353244 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop prerendering thumbs at 2560/2880 pixels

https://gerrit.wikimedia.org/r/353244

Mentioned in SAL (#wikimedia-operations) [2017-05-11T13:31:00Z] <addshore@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T162796 [[gerrit:353244|Stop prerendering thumbs at 2560/2880 pixels]] (duration: 00m 41s)

List of candidates for deletion: (note some criteria might overlap)

CriteriaCountBytes (GB)
Widths with less than 1000 thumbnails stored7941491999
Rarely requested but we used to prerender (2880/2560 px)1703635917226
0px width275112168
Width greater than 5000px5126411556

Mentioned in SAL (#wikimedia-operations) [2017-05-25T15:43:56Z] <godog> delete thumbnails with > 2000px for wikivoyage / wikiversity / wikisource / wikiquote - T162796

I started deleting today all thumbnails with widths > 2000px in small containers (i.e. non-commons)

With a container at a time we're issuing around 20 delete/s, meaning ~6h to delete all 460k objects (480GB) for non-commons containers.

For commons we're looking at 23M objects to delete (26TB), at 20 delete/s that's ~13d when done serially per-container, or less when done in parallel.

I suggest to use parquet for analytics-usage tables:

create table filippo.webrequest_upload_pixel_response_cache_201704
STORED AS PARQUET

as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as pixels, response_size, cache_status, uri_path from wmf.webrequest where webrequest_source='upload' and year = 2017 and month = 4 and uri_path like '%/thumb/%px-%';

thanks @JAllemandou ! I've converted the tables to use parquet and dropped the old plaintext tables

This is completed as far as the cleanup is concerned, I've started https://wikitech.wikimedia.org/wiki/Swift/Thumbnails_Cleanup with a summary and context