Delete non-used and/or non-requested thumbnail sizes periodically
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Apr 12 2017, 11:33 AM

Description

As of Apr 2017 thumbnails stored in swift are never deleted unless purged by mediawiki. This is useful in the general case but wasteful long-term, there is likely a lot (TBD) of thumbnails we can delete according to various criteria. For example:

We only have e.g. less than 200 thumbnails for a given size
Size hasn't been requested in the last 90 days
Obviously huge sizes, e.g. wider than 10^5 pixels

Mediawiki doesn't know anything about what sizes have been stored or requested, therefore we'll have to walk the swift containers for deletion candidates. We can reuse thumb-stats from operations/software that does exactly this: walks all thumbnail containers and extract statistics about file size / dimensions and so on.

note the "pixels" and "size" in this context refer to thumbnail width, i.e. the pixel count that shows up in thumbnail URL

Details

	Subject	Repo	Branch	Lines +/-
	Stop prerendering thumbs at 2560/2880 pixels	operations/mediawiki-config	master	+2 -2
	thumbstats: add Hive export	operations/software	master	+44 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fgiunchedi	T162792 Reduce Swift technical debt
		Resolved		fgiunchedi	T162796 Delete non-used and/or non-requested thumbnail sizes periodically

Event Timeline

fgiunchedi created this task.Apr 12 2017, 11:33 AM

I started doing some analytics with hive on webrequest data for upload, reporting the queries here for reference. Note that running a query over a month of data took ~1h, writing the query into another table allows for faster querying/processing later.

create table filippo.thumb_pixels_201704  as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as pixels, response_size from wmf.webrequest where webrequest_source='upload' and year = 2017 and month = 4  and uri_path like '%/thumb/%px-%';

insert overwrite directory '/user/filippo/thumb_pixel' row format delimited fields terminated by ' ' select pixels, count(pixels) as count, avg(response_size) as avg_size from thumb_pixels_201704 group by pixels;

And a rough estimation of the long tail, note that ~60% of sizes have been requested less than 1000 times in april. Only 4% of sizes are requested more than once per second (on average in april)

stat1004:/mnt/hdfs/user/filippo/thumb_pixel$ cat * | sort -k2 -rn > ~/pixel_count_avgsize
$ wc -l ~/pixel_count_avgsize 
10528 /home/filippo/pixel_count_avgsize
$ for i in 1000 10000 50000 150000 1100000 2200000 ; do
> echo -n "$i => " ; awk "\$2 > $i {print}" ~/pixel_count_avgsize | wc -l ; done
1000 => 3032
10000 => 1847
50000 => 1301
150000 => 984
1100000 => 572
2200000 => 432

• ema subscribed.Apr 28 2017, 3:01 PM

Some more frequency distributions of width in pixels vs number of requests to cache_upload during April using bitly's data hacks

P5347 Hive and data_hacks fun

1	$ cat pixel_count_avgsize \|grep -v '146.2784580498866' \| awk '{print $1 " " $2}' \| ~/.local/bin/histogram.py -A --max 5000 -p
2	# NumSamples = 82942062200; Min = 0.00; Max = 5000.00
3	# 830617 values outside of min/max
4	# Mean = inf; Variance = inf; SD = inf; Median 5227.000000
5	# each ∎ represents a count of 1054729691
6	0.0000 - 500.0000 [79104726879]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.37%)
7	500.0000 - 1000.0000 [3055758594]: ∎∎ (3.68%)
8	1000.0000 - 1500.0000 [607802792]: (0.73%)
9	1500.0000 - 2000.0000 [151078118]: (0.18%)
10	2000.0000 - 2500.0000 [6672115]: (0.01%)
11	2500.0000 - 3000.0000 [12250857]: (0.01%)
12	3000.0000 - 3500.0000 [986989]: (0.00%)
13	3500.0000 - 4000.0000 [978102]: (0.00%)
14	4000.0000 - 4500.0000 [374602]: (0.00%)
15	4500.0000 - 5000.0000 [602535]: (0.00%)
16
17	$ cat pixel_count_avgsize \|grep -v '146.2784580498866' \| awk '{print $1 " " $2}' \| ~/.local/bin/histogram.py -A --max 1500 -p
18	# NumSamples = 82942062200; Min = 0.00; Max = 1500.00
19	# 173773935 values outside of min/max
20	# Mean = inf; Variance = inf; SD = inf; Median 5227.000000
21	# each ∎ represents a count of 608377288
22	0.0000 - 150.0000 [45628296609]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (55.01%)
23	150.0000 - 300.0000 [28113428738]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (33.90%)
24	300.0000 - 450.0000 [4825666280]: ∎∎∎∎∎∎∎ (5.82%)
25	450.0000 - 600.0000 [1110153755]: ∎ (1.34%)
26	600.0000 - 750.0000 [1978740711]: ∎∎∎ (2.39%)
27	750.0000 - 900.0000 [444815750]: (0.54%)
28	900.0000 - 1050.0000 [208827229]: (0.25%)
29	1050.0000 - 1200.0000 [236027095]: (0.28%)
30	1200.0000 - 1350.0000 [209377978]: (0.25%)
31	1350.0000 - 1500.0000 [12954120]: (0.02%)
32
33	$ cat pixel_count_avgsize \|grep -v '146.2784580498866' \| awk '{print $1 " " $2}' \| ~/.local/bin/histogram.py -A --max 750 -p
34	# NumSamples = 82942062200; Min = 0.00; Max = 750.00
35	# 1285776107 values outside of min/max
36	# Mean = inf; Variance = inf; SD = inf; Median 5227.000000
37	# each ∎ represents a count of 443747644
38	0.0000 - 75.0000 [33281073339]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (40.13%)
39	75.0000 - 150.0000 [12347223270]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (14.89%)
40	150.0000 - 225.0000 [20496799287]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (24.71%)
41	225.0000 - 300.0000 [7616629451]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.18%)
42	300.0000 - 375.0000 [2831749660]: ∎∎∎∎∎∎ (3.41%)
43	375.0000 - 450.0000 [1993916620]: ∎∎∎∎ (2.40%)
44	450.0000 - 525.0000 [664318218]: ∎ (0.80%)
45	525.0000 - 600.0000 [445835537]: ∎ (0.54%)
46	600.0000 - 675.0000 [467549947]: ∎ (0.56%)
47	675.0000 - 750.0000 [1511190764]: ∎∎∎ (1.82%)

Change 351793 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/software@master] thumbstats: add Hive export

https://gerrit.wikimedia.org/r/351793

gerritbot added a project: Patch-For-Review.May 4 2017, 9:12 AM

Change 351793 merged by Filippo Giunchedi:
[operations/software@master] thumbstats: add Hive export

https://gerrit.wikimedia.org/r/351793

I've extracted some data from the list of thumbnails we are storing in swift and processed it with hive, distribution of size vs number of thumbnails we store (more decent graphs coming)

$ sort -n pixel_count_size | grep -v -- '\N' | awk '{ print $1 " " $2 }' | ~/.local/bin/histogram.py -A --max 5000 -p
# NumSamples = 932405127; Min = 0.00; Max = 5000.00
# 370902 values outside of min/max
# Mean = 146.773189; Variance = 911349252814.819092; SD = 954646.140104; Median 7048.000000
# each ∎ represents a count of 8650312
    0.0000 -   500.0000 [648773431]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (69.58%)
  500.0000 -  1000.0000 [156884854]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (16.83%)
 1000.0000 -  1500.0000 [76993267]: ∎∎∎∎∎∎∎∎ (8.26%)
 1500.0000 -  2000.0000 [24060769]: ∎∎ (2.58%)
 2000.0000 -  2500.0000 [4634774]:  (0.50%)
 2500.0000 -  3000.0000 [18740364]: ∎∎ (2.01%)
 3000.0000 -  3500.0000 [842644]:  (0.09%)
 3500.0000 -  4000.0000 [357358]:  (0.04%)
 4000.0000 -  4500.0000 [459133]:  (0.05%)
 4500.0000 -  5000.0000 [287631]:  (0.03%)

And size vs bytes stored

$ sort -n pixel_count_size | grep -v -- '\N' | awk '{ print $1 " " $3 }' | ~/.local/bin/histogram.py -A --max 5000 -p
# NumSamples = 93158107977469; Min = 0.00; Max = 5000.00
# 1119622544767 values outside of min/max
# Mean = 0.001469; Variance = 9892379536331.593750; SD = 3145215.340216; Median 7048.000000
# each ∎ represents a count of 264775386184
    0.0000 -   500.0000 [12839588070271]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.78%)
  500.0000 -  1000.0000 [19220388721260]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (20.63%)
 1000.0000 -  1500.0000 [19858153963807]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (21.32%)
 1500.0000 -  2000.0000 [12166146568579]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.06%)
 2000.0000 -  2500.0000 [4265428797593]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (4.58%)
 2500.0000 -  3000.0000 [19725491234058]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (21.17%)
 3000.0000 -  3500.0000 [1441196798896]: ∎∎∎∎∎ (1.55%)
 3500.0000 -  4000.0000 [787038598971]: ∎∎ (0.84%)
 4000.0000 -  4500.0000 [920468982555]: ∎∎∎ (0.99%)
 4500.0000 -  5000.0000 [814583696712]: ∎∎∎ (0.87%)

What "size" are we talking about when you have a range like 0.0000 - 500.0000?

@Gilles the image width in pixels, i.e. the user-provided size in the url

fgiunchedi updated the task description. (Show Details)May 4 2017, 1:28 PM

OK. I'm not sure it makes sense to group them in ranges, we should be looking at widths individually. I.e. 1024 is probably very prevalent, while 1025 isn't, and if you're going by ranges they'd be in the same bucket.

@Gilles indeed for "number of thumbnails stored" I think it makes sense to extract a topN. Anyways I'm still thinking through the criterias that might make sense for deletion, if you have ideas and suggestions please let me know!

So far one thing that jumped my eye is that width >= 2000 pixels is practically never requested, yet sizes >= 2000 pixels account for ~30% of space used on swift, so tackling that might be a good first candidate for space savings.

Indeed, sounds like an easy win

@Gilles indeed ! I'll figure out the best way to get a list of objects to delete and batch-delete

I took another look at the data from a "topN" perspective, comparing widths stored vs widths requested:

Stored in swift:

Width	Count	Total bytes
2880	11964843	12831311775718
1920	11261554	6425015169533
1024	28957125	5976511708956
1280	20785037	5873474216733
800	31682681	4465070770146
2560	5071516	4395627257010
640	28997613	2734005332490
1200	10402324	2729282111853
1600	3503746	1382023698588
600	11023267	977115815901
320	33266285	945343491331
2048	1183588	843496615380
500	11243190	677898639126
180	44676093	535302842424
1000	2895047	517840202567
1599	1329183	501813702161
300	17296731	486523098940
768	3044361	484272524988
400	11014907	482515840480
450	8236015	465052848325
2000	1274958	457963673795
240	27997194	456221478962
5000	141739	436817216837
720	3455190	411348308330
3072	234411	329534967327
1707	681331	326969562468
440	7110046	324421309605
2481	318200	314195236094
1440	601393	303502728402
4096	162046	301493984305
512	4067290	300585251356

Widths requested on cache_upload during 201704

Width	Requests	Average size
220	11606001844	18290.62334
20	4136250783	603.4446222
23	3705096914	350.4100487
160	2911655164	10437.8238
120	2818712767	7015.523756
40	2569391790	1274.372662
250	2434321520	23915.39365
200	2321354331	16432.00934
22	2213272469	393.5257919
30	1959888544	1179.878229
300	1804325358	29569.26509
100	1658506555	6745.631721
80	1620617095	4712.440638
50	1578633560	2371.793851
720	1418454605	82160.62129
170	1262430851	15548.34542
150	1217279056	11293.65764
25	1166032448	864.6937873
330	1120962933	33548.49958
440	1100258823	49460.84618
45	1054991445	1389.348252
16	1043468178	620.1059713
35	916438036	1161.290896
15	896978994	642.609749
180	805350434	13617.88147
60	791369585	2933.196868
240	775277138	20032.23698
44	773328259	780.1695929
24	766488859	994.1352453
46	728860255	738.9990126
32	699403879	1141.116107
320	656195920	22901.20572
90	558499230	5817.858014
12	550757831	483.8175996
280	485040041	27951.14572
18	448330824	689.318727

Observations:

We can stop pre-generating 2880px and 2560px I think, they take up a lot of space but have been requested 1283304 / 7774542 times and (537th place and 334th place)
The second-biggest size (1920px) should stay, requested 63241046 times (119th place)

The data is also available on this google spreadsheet

The top 100 most requested sizes represent 91.21% of all requests. The remaining long tail (any size not in that 100 whitelist) represents 68% of the storage size.

It would be interesting to know for the long tail how many requests are complete varnish misses going to swift. We would know exactly how much extra traffic would go to the image scalers if we only stored the 100 most requested sizes in Swift. Can you generate that data for all sizes? (# of full Varnish misses going to Swift for 201704).

In T162796#3248654, @Gilles wrote:

The top 100 most requested sizes represent 91.21% of all requests. The remaining long tail (any size not in that 100 whitelist) represents 68% of the storage size.

It would be interesting to know for the long tail how many requests are complete varnish misses going to swift. We would know exactly how much extra traffic would go to the image scalers if we only stored the 100 most requested sizes in Swift. Can you generate that data for all sizes? (# of full Varnish misses going to Swift for 201704).

Indeed, I'm creating another hive table from webrequest including cache_status which should be able to tell us what happened to a particular thumb

create table filippo.webrequest_upload_pixel_response_cache_201704  as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as pixels, response_size, cache_status, uri_path from wmf.webrequest where webrequest_source='upload' and year = 2017 and month = 4  and uri_path like '%/thumb/%px-%';

@Gilles I've added two more sheets for hit and miss+pass from webrequest to the spreadsheet, looks like in April there were ~400M misses for sizes non-top100

400m misses means 154 requests per second. It would at least triple the load on Thumbor. Might be possible if/once we've repurposed all existing image scalers to Thumbor.

Change 353244 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/mediawiki-config@master] Stop prerendering thumbs at 2560/2880 pixels

https://gerrit.wikimedia.org/r/353244

Change 353244 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop prerendering thumbs at 2560/2880 pixels

https://gerrit.wikimedia.org/r/353244

Mentioned in SAL (#wikimedia-operations) [2017-05-11T13:31:00Z] <addshore@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T162796 [[gerrit:353244|Stop prerendering thumbs at 2560/2880 pixels]] (duration: 00m 41s)

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.May 11 2017, 2:58 PM

List of candidates for deletion: (note some criteria might overlap)

Criteria	Count	Bytes (GB)
Widths with less than 1000 thumbnails stored	794149	1999
Rarely requested but we used to prerender (2880/2560 px)	17036359	17226
0px width	275112	168
Width greater than 5000px	512641	1556

Volans subscribed.May 16 2017, 9:42 PM

• Gilles added a subscriber: • Fjalapeno.May 19 2017, 3:55 PM

• Gilles added a subscriber: • GWicke.May 19 2017, 4:22 PM

Mentioned in SAL (#wikimedia-operations) [2017-05-25T15:43:56Z] <godog> delete thumbnails with > 2000px for wikivoyage / wikiversity / wikisource / wikiquote - T162796

I started deleting today all thumbnails with widths > 2000px in small containers (i.e. non-commons)

With a container at a time we're issuing around 20 delete/s, meaning ~6h to delete all 460k objects (480GB) for non-commons containers.

For commons we're looking at 23M objects to delete (26TB), at 20 delete/s that's ~13d when done serially per-container, or less when done in parallel.

I suggest to use parquet for analytics-usage tables:

create table filippo.webrequest_upload_pixel_response_cache_201704

STORED AS PARQUET

as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as pixels, response_size, cache_status, uri_path from wmf.webrequest where webrequest_source='upload' and year = 2017 and month = 4 and uri_path like '%/thumb/%px-%';

thanks @JAllemandou ! I've converted the tables to use parquet and dropped the old plaintext tables

This is completed as far as the cleanup is concerned, I've started https://wikitech.wikimedia.org/wiki/Swift/Thumbnails_Cleanup with a summary and context

EBernhardson subscribed.Dec 12 2019, 5:50 PM

Delete non-used and/or non-requested thumbnail sizes periodicallyClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Delete non-used and/or non-requested thumbnail sizes periodically
Closed, ResolvedPublic
Actions

Related Objects
Search...