Page MenuHomePhabricator

RESTBase storage capacity planning
Closed, ResolvedPublic

Description

While normalized storage utilization (at the time of this writing) is ~50%, we are over-utilized on device IO. This leaves us in a position where we are no longer able to support new storage use-cases until additional capacity is created.

The primary culprit here are the Samsung SSDs installed in 17 of our 24 hosts. These SSDs (purchased outside the usual channels due to cost concerns) exhibit significantly higher CPU iowait for a given number of IOPS when compared to the Intel and HP disks the foundation typically purchases. This elevated iowait directly correlates to higher Cassandra read latencies; If it weren't for the 7 hosts equipped with well-performing disks, Cassandra's ability to route around poorly performing hosts, and speculative retries, the impact on end-users would be unacceptable.

IO Capacity (read: SSD Performance)

IOPS (r/w)Bandwidth (r/w)Latency (typical)
Samsung1763 / 1977052KB / 788KB1-40ms
Intel27538 / 3077110153KB / 12309KB200us

Storage Capacity

IO capacity notwithstanding, there is also a finite amount of utilizable storage space, and a number of (unplanned) use-cases have been proposed. Based on the number of hosts per rack, and the runway needed to support organic growth, the upper bound on utilization (i.e. the point where we commission no further storage use-cases) is 60% (see Cassandra/CapacityPlanning#Establishing_an_Upper_Bound for more on this).

Proposal

It would seem the only remedy for the Samsung SSDs is to replace them. Simply replacing the affected SSDs will restore expected performance, and give us some much needed breathing room, but won't provide much in the way of additional storage capacity for unplanned use-cases. Therefore, we should capitalize on the effort spent here to increase storage capacity as well, (at least by enough to get us through this fiscal year). In some cases, it may make more sense to replace the entire host instead (for example, a lease is about to expire, or warranty about to end).

Affected hosts

  • {T205092}
  • restbase2007
  • restbase2008
  • restbase1007
  • restbase1008
  • restbase1009
  • restbase1010
  • restbase1011
  • restbase1012
  • restbase1013
  • restbase1014
  • restbase1015

Update

As an example of the concrete problems this creates, consider the failure of restbase1015 on 2018-10-14. After the recent data-center switchover, the async jobs were kept in eqiad (read: eqiad was handling both live requests, and background processing). With this added load, latency became unacceptably high when restbase1015 went down; There wasn't enough headroom to weather the failure of one host (3 instances).

Screenshot from 2018-10-19 14-28-40.png (576×1 px, 84 KB)

Source: https://grafana.wikimedia.org/dashboard/snapshot/DjZYaOas2Rcp904crR4PXac5ZyC58JcQ?orgId=1


See also:

RESTBase storage capacity planning (Google Doc)

Event Timeline

Eevans updated the task description. (Show Details)

Idea:

We have a good understanding of the limits of the Samsung devices (since we routinely run them past those limits), less so for the Intel/HPs, because they are so very under-utilized. Perhaps the easiest way to create a concrete comparison (and infer needed capacity), would be to shutdown Cassandra and benchmark a quiescent storage device. We could do this for a Samsung-equipped host (say restbase2001), and an Intel-equipped one (say restbase2010). Another Intel-equipped host not in use could work as well.

We can use fio to measure IOPS, for example:

IO benchmark
fio --randrepeat=1 \
    --ioengine=libaio \
    --direct=1 \
    --gtod_reduce=1 \
    --name=test \
    --filename=test \
    --bs=4k \
    --iodepth=64 \
    --size=4G \
    --readwrite=randrw \
    --rwmixread=90
NOTE: We have an approximately 90/10 split read/write (a little less in codfw, a little more so in eqiad)

We can use ioping to measure latency, for example:

Latency
ioping -c 25 <path>

Comments, suggestions, etc?

@Eevans +1 to benchmark different devices and possibly varying the fio parameters too, since now we have a pretty good idea of what the workload looks like.

@Eevans +1 to benchmark different devices and possibly varying the fio parameters too, since now we have a pretty good idea of what the workload looks like.

Do you have specific variations in mind?

@Eevans +1 to benchmark different devices and possibly varying the fio parameters too, since now we have a pretty good idea of what the workload looks like.

Do you have specific variations in mind?

Mostly playing with a lower iodepth e.g. 4 as I don't think we see a lot of queueing during normal workloads?

@Eevans +1 to benchmark different devices and possibly varying the fio parameters too, since now we have a pretty good idea of what the workload looks like.

Do you have specific variations in mind?

Mostly playing with a lower iodepth e.g. 4 as I don't think we see a lot of queueing during normal workloads?

OK; makes sense

Vvjjkkii renamed this task from RESTBase storage capacity planning to utaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii removed Eevans as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
JJMC89 renamed this task from utaaaaaaaa to RESTBase storage capacity planning.Jul 1 2018, 1:55 AM
JJMC89 assigned this task to Eevans.
JJMC89 lowered the priority of this task from High to Medium.
JJMC89 updated the task description. (Show Details)
JJMC89 added a subscriber: Aklapper.

Mentioned in SAL (#wikimedia-operations) [2018-07-02T19:25:08Z] <urandom> Bringing down Cassandra for hardware testing, restbase2001 - T197477

Mentioned in SAL (#wikimedia-operations) [2018-07-02T20:53:20Z] <urandom> Bringing down Cassandra for hardware testing, restbase2010 - T197477

Test command
fio --randrepeat=1 \
    --ioengine=libaio \
    --direct=1 \
    --gtod_reduce=1 \
    --name=test \
    --filename=/srv/sda4/fio.test \
    --bs=4k \
    --iodepth=4 \
    --size=4G \
    --readwrite=randrw \
    --rwmixread=90
restbase2001
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=4
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [m(1)] [100.0% done] [9289KB/932KB/0KB /s] [2322/233/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=26514: Mon Jul  2 19:54:15 2018
  read : io=3684.4MB, bw=7052.7KB/s, iops=1763, runt=534935msec
  write: io=421580KB, bw=807010B/s, iops=197, runt=534935msec
  cpu          : usr=0.94%, sys=3.75%, ctx=717979, majf=0, minf=6
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=943181/w=105395/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: io=3684.4MB, aggrb=7052KB/s, minb=7052KB/s, maxb=7052KB/s, mint=534935msec, maxt=534935msec
  WRITE: io=421580KB, aggrb=788KB/s, minb=788KB/s, maxb=788KB/s, mint=534935msec, maxt=534935msec

Disk stats (read/write):
  sda: ios=943453/106393, merge=129/675, ticks=689976/1430568, in_queue=2120400, util=99.76%
restbase2010
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=4
fio-2.1.11
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [104.5MB/11948KB/0KB /s] [26.8K/2987/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=13120: Mon Jul  2 21:01:59 2018
  read : io=3684.4MB, bw=110153KB/s, iops=27538, runt= 34250msec
  write: io=421580KB, bw=12309KB/s, iops=3077, runt= 34250msec
  cpu          : usr=6.00%, sys=28.54%, ctx=673858, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=943181/w=105395/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: io=3684.4MB, aggrb=110152KB/s, minb=110152KB/s, maxb=110152KB/s, mint=34250msec, maxt=34250msec
   WRITE: io=421580KB, aggrb=12308KB/s, minb=12308KB/s, maxb=12308KB/s, mint=34250msec, maxt=34250msec

Disk stats (read/write):
  sda: ios=941238/105264, merge=0/53, ticks=122504/3672, in_queue=125996, util=99.75%
NOTE: Machines were tested after all Cassandra instances were brought down.
NOTE: Three invocations of fio were issued on each host, with no significant variation in the results (results from the second invocation of each are used above)

Change 462569 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] resbase: increase latency notification thresholds

https://gerrit.wikimedia.org/r/462569

Change 462569 abandoned by Eevans:
restbase: increase latency notification thresholds

Reason:
This changes a notification that isn't (any longer) what we thought it was (and afaict, not something we want to notify for). This is going to require something entirely different (TBD).

https://gerrit.wikimedia.org/r/462569

Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

@Eevans I would guss this ticket is very much outdated?

I guess it was left open because it has sub-tickets still open (of which T222960: Fix restbase1017's physical rack is still unresolved).