While normalized storage utilization (at the time of this writing) is ~50%, we are over-utilized on device IO. This leaves us in a position where we are no longer able to support new storage use-cases until additional capacity is created.
The primary culprit here are the Samsung SSDs installed in 17 of our 24 hosts. These SSDs (purchased outside the usual channels due to cost concerns) exhibit significantly higher CPU iowait for a given number of IOPS when compared to the Intel and HP disks the foundation typically purchases. This elevated iowait directly correlates to higher Cassandra read latencies; If it weren't for the 7 hosts equipped with well-performing disks, Cassandra's ability to route around poorly performing hosts, and speculative retries, the impact on end-users would be unacceptable.
This ticket will serve to collect the information necessary to inform options moving forward.
When rebuilding/reshaping the cluster for the new storage strategy, we took advantage of Cassandra's improved support for JBOD, and did away with the single RAID-0 we had used previously. We did this to a) avoid the blast-radius created by a single disk failure, and b) to partition compaction (improved cardinality, concurrency). However, it bears highlighting the drawbacks; It is obvious when looking at the [[ https://grafana.wikimedia.org/dashboard/db/cassandra-system | dashboards ]] (IOPS, throughput, and iowait), that spikes in iowait are almost exclusively the result of increased utilization of a single device.
Average aggregate IOPS in codfw is ~6300 (all storage devices, all hosts).
Average aggregate IOPS in eqiad is ~4950 (all storage devices, all hosts).