While normalized storage utilization (at the time of this writing) is ~50%, we are over-utilized on device IO. This leaves us in a position where we are no longer able to support additional storage use-cases until additional capacity is created.
The primary culprit here are the Samsung SSDs installed in 17 of our 24 hosts. These SSDs (purchased outside the usual channels due to cost concerns) exhibit significantly higher CPU iowait for a given number of IOPS when compared to the Intel and HP disks the foundation typically purchases. This elevated iowait directly correlates to higher Cassandra read latencies; If it weren't for the 7 hosts equipped with well-performing disks, Cassandra's ability to route around poorly performing hosts, and speculative retries, the impact on end-users would be unacceptable.
This ticket will serve to collect the information necessary to inform options moving forward.