Page MenuHomePhabricator

swift capacity planning
Open, NormalPublic

Description

swift in eqiad is used on average at 77% now, we should look into adding more capacity.

note that even if we were to move thumbnails off swift, in terms of raw space used originals still dominates. when I last run the numbers in july originals were 50T and thumbs 25T.

the growth looks like this:

swift growth in eqiad

Details

Related Gerrit Patches:
operations/software/swift-ring : mastereqiad-prod: ms-be101[567] object weight to 1000
operations/puppet : productionswift: increase rsync server max_connections
operations/software/swift-ring : mastereqiad-prod: add ms-be101[678]

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: acl*sre-team.
fgiunchedi changed Security from none to None.
fgiunchedi added subscribers: fgiunchedi, faidon.

per-day growth

updated stats from @faidon

11:19 <paravoid> 58030481580675 (58.0TB) 27723584 public
11:19 <paravoid> 32045513679204 (32.0TB) 428733494 thumb
11:19 <paravoid> 51652815751 (51.7GB) 9808 temp
11:19 <paravoid> 4759270036442 (4.8TB) 6032579 deleted
11:19 <paravoid> 2543957194018 (2.5TB) 234330 transcoded
11:19 <paravoid> 28964076056 (29.0GB) 9315391 render
fgiunchedi added a comment.EditedDec 11 2014, 4:17 PM

updated stats:

growth by day

so it grows on average ~140GB/day (usable space, raw would be 3x that)

the last three machines we added lasted ~8months:

and with 3x replication that means that we essentially used a machine's capacity over eight months (3TB disks)

2800G * 12 disks = 33600G / (8 * 30) = 140GB

so the estimate seems to be correct.

Three additional machines will give us another 8 months at the current rate, I think it makes sense to go ahead with an order for those.

fgiunchedi triaged this task as Normal priority.Dec 22 2014, 9:01 AM

updated stats:

growth by day

so it grows on average ~140GB/day on raw space or ~46GB/day with 3x replication.

the last three machines we added lasted ~8months:

and with 3x replication that means that we essentially used a machine's capacity over eight months (3TB disks)

2800G * 12 disks = 33600G / (8 * 30) = 140GB

so the estimate seems to be correct.

Three additional machines will give us another 8 months at the current rate, I think it makes sense to go ahead with an order for those.

Change 198256 had a related patch set uploaded (by Filippo Giunchedi):
eqiad-prod: add ms-be101[678]

https://gerrit.wikimedia.org/r/198256

Change 198256 merged by Filippo Giunchedi:
eqiad-prod: add ms-be101[678]

https://gerrit.wikimedia.org/r/198256

Change 198697 had a related patch set uploaded (by Filippo Giunchedi):
swift: increase rsync server max_connections

https://gerrit.wikimedia.org/r/198697

Change 198697 merged by Filippo Giunchedi:
swift: increase rsync server max_connections

https://gerrit.wikimedia.org/r/198697

Change 198756 had a related patch set uploaded (by Filippo Giunchedi):
eqiad-prod: ms-be101[567] object weight to 1000

https://gerrit.wikimedia.org/r/198756

Change 198756 merged by Filippo Giunchedi:
eqiad-prod: ms-be101[567] object weight to 1000

https://gerrit.wikimedia.org/r/198756

avg 77% used, new machines currently at weight 2000 and rebalancing

$ sudo swift-recon -d --human-readable
===============================================================================
--> Starting reconnaissance on 18 hosts
===============================================================================
[2015-04-07 10:28:24] Checking disk usage now
Distribution Graph:
  7%    2 ***
  9%    2 ***
 10%    3 ****
 12%    2 ***
 13%    2 ***
 14%    1 *
 16%    2 ***
 17%    1 *
 18%    3 ****
 19%    2 ***
 20%    2 ***
 21%    6 *********
 23%    3 ****
 24%    2 ***
 26%    1 *
 27%    1 *
 29%    1 *
 44%    1 *
 45%   29 *********************************************
 46%    6 *********
 82%    2 ***
 83%   13 ********************
 84%   14 *********************
 85%   17 **************************
 86%   41 ****************************************************************
 87%   44 *********************************************************************
 88%   32 **************************************************
 89%   16 *************************
Disk usage: space used: 392 TB of 506 TB
Disk usage: space free: 114 TB of 506 TB
Disk usage: lowest: 7.03%, highest: 89.98%, avg: 77.4533400292%
===============================================================================

new machines fully in service at weight 3000, old machines are still freeing up space

[2015-05-04 10:14:34] Checking disk usage now
Distribution Graph:
  7%    2 **
  8%    1 *
  9%    3 ****
 14%    1 *
 16%    3 ****
 17%    6 ********
 18%    3 ****
 19%    3 ****
 20%    4 *****
 21%    2 **
 22%    2 **
 23%    1 *
 24%    1 *
 25%    2 **
 26%    1 *
 27%    1 *
 71%   32 **********************************************
 72%    4 *****
 73%   16 ***********************
 74%   47 *********************************************************************
 75%   19 ***************************
 76%    2 **
 78%    8 ***********
 79%   34 *************************************************
 80%   37 ******************************************************
 81%   14 ********************
 82%    2 **
 83%    1 *
Disk usage: space used: 386 TB of 508 TB
Disk usage: space free: 122 TB of 508 TB
Disk usage: lowest: 7.0%, highest: 83.08%, avg: 75.9807406116%

daily growth for the last 4mo in eqiad, still hovering around 140GB/day

we're at an average 81% utilization in eqiad, in other words ~5% increase in 115 days

Last order was 3x ms-be for eqiad only, IIRC we have budgeted for 5x for eqiad/codfw. Next order should fill the gap between eqiad and codfw, so we could do +6 in codfw and +3 in eqiad

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 26 2015, 4:07 PM
brion added a subscriber: brion.Sep 1 2015, 4:21 PM
fgiunchedi renamed this task from swift eqiad capacity planning to swift capacity planning.Sep 21 2015, 3:17 PM

over the last year we're still averaging ~140GB/day or ~51TB/year (not including 3x replication) : media account bytes

each 3TB machine has ~2.8T * 12 = 33.6T available.

Say we want to get that capacity at 75% utilization, that's ~25TB available, so to cover one year we need 51/25 = 2x machines, plus 3x replication that's 6x 3TB machines / year (per datacenter).

another factor for capacity swift capacity planning purposes is space allocated for different container types, most importantly thumbs and originals (69T vs 89T)

swift eqiad production, last 15 weeks

Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptJul 12 2016, 11:27 AM