While normalized storage utilization (at the time of this writing) is ~50%, we are over-utilized on device IO. This leaves us in a position where we are no longer able to support new storage use-cases until additional capacity is created.
The primary culprit here are the Samsung SSDs installed in 17 of our 24 hosts. These SSDs (purchased outside the usual channels due to cost concerns) exhibit significantly higher CPU iowait for a given number of IOPS when compared to the Intel and HP disks the foundation typically purchases. This elevated iowait directly correlates to higher Cassandra read latencies; If it weren't for the 7 hosts equipped with well-performing disks, Cassandra's ability to route around poorly performing hosts, and speculative retries, the impact on end-users would be unacceptable.
IO Capacity (read: SSD Performance)
IOPS (r/w) | Bandwidth (r/w) | Latency (typical) | |
---|---|---|---|
Samsung | 1763 / 197 | 7052KB / 788KB | 1-40ms |
Intel | 27538 / 3077 | 110153KB / 12309KB | 200us |
Storage Capacity
IO capacity notwithstanding, there is also a finite amount of utilizable storage space, and a number of (unplanned) use-cases have been proposed. Based on the number of hosts per rack, and the runway needed to support organic growth, the upper bound on utilization (i.e. the point where we commission no further storage use-cases) is 60% (see Cassandra/CapacityPlanning#Establishing_an_Upper_Bound for more on this).
Proposal
It would seem the only remedy for the Samsung SSDs is to replace them. Simply replacing the affected SSDs will restore expected performance, and give us some much needed breathing room, but won't provide much in the way of additional storage capacity for unplanned use-cases. Therefore, we should capitalize on the effort spent here to increase storage capacity as well, (at least by enough to get us through this fiscal year). In some cases, it may make more sense to replace the entire host instead (for example, a lease is about to expire, or warranty about to end).
Affected hosts
- {T205092}
- restbase2007
- restbase2008
- restbase1007
- restbase1008
- restbase1009
- restbase1010
- restbase1011
- restbase1012
- restbase1013
- restbase1014
- restbase1015
Update
As an example of the concrete problems this creates, consider the failure of restbase1015 on 2018-10-14. After the recent data-center switchover, the async jobs were kept in eqiad (read: eqiad was handling both live requests, and background processing). With this added load, latency became unacceptably high when restbase1015 went down; There wasn't enough headroom to weather the failure of one host (3 instances).
Source: https://grafana.wikimedia.org/dashboard/snapshot/DjZYaOas2Rcp904crR4PXac5ZyC58JcQ?orgId=1