Page MenuHomePhabricator

ms swift capacity for FY 26/27
Open, Needs TriagePublic

Description

We need to decide if and how extra capacity we want to budget for in FY 26/27. To do that, we need to estimate our capacity needs in 15 months' time (i.e. the end of 26/27 Q4, or June 2027). This is challenging, because over the last 12 months (about as far back as grafana will easily let us plot), growth has been distinctly non-linear -

Swift-1773159355453.png (1×1 px, 90 KB)

[the green line is eqiad, the yellow codfw - click on the image to get a more usefully-sized graph, and note the non-zero Y axis]

To put that in concrete terms, make estimates of annual growth rate based on 12,6, and 2 months ago, with current usage at 1.23pB in codfw, 1.26pB in eqiad:

Months agocodfw size (tB)codfw growth since (tB)codfw growth rate (tB/y)eqiad size (tB)eqiad growth since (tB)eqiad growth rate (tB/y)
129382922921017243243
69922384761070190380
21120110660118080480

We use 3x replication, so these figures need multiplying by 3 to get raw capacity. Currently codfw has 33 storage hosts (so 6336TB capacity), and eqiad has 34 (so 6528TB), and capacity is about 73% used. We'd like to not plan to exceed 85% capacity.

The numbers of objects are declining because of on-going work to delete thumbnails (T379942), but that capacity gain is more than offset by uploads of more and larger originals.

If we take the most pessimistic projection of 660TB/year, then over 15 months that equates to 660*15*3/12 = 2475TB in raw storage.

Current total raw storage use in eqiad (per swift-recon -d) is 4230 TB. Adding 2475 gives 6705 TB. If we want that to be 85% of total capacity, we need total capacity to be 7888 TB, which is just over 41 Config-J systems, 7 more than we currently have.

Current total raw storage use in codfw is 4254 TB. Adding 2475 gives 6729 TB. If we want that to be 85% of total capacity we need total capacity to be 7916TB, which is likewise just over 41 Config-J systems, 8 more than we currently have.

Event Timeline

I have sped up the deletion of thumbnails, maybe that'll make a dent? let's see

And another thing, I'm planning to shut down transcoding of videos that are not used anywhere in the projects, so that should reduce the size of transcode bucket by roughly 90%. I don't know how big these containers are though.

A quick back-of-the-envelope is about 73TB for commons transcoded buckets.

A quick back-of-the-envelope is about 73TB for commons transcoded buckets.

Thanks! and adding around 100TB thumbnail clean up removals per dc, I think we can reduce maybe one host per dc from the expansion (100TB is a very conservative estimate, It'll be around 200TB)