As an example, a recent RAID rebuild elevated iowait, and higher columnfamily read latencies ensued. This is not unexpected. However, the increased read latency of this one node directly translated to higher 99p RESTBase latencies. Given there are two other replicas with normal latency in this situation, the ideal behavior would be one where we routed queries around the latent node.
- Implement nodejs driver support for speculative retries
- Implement a latency aware (wrapper) policy to order TokenAware results by latency, ascending
- Implement a black-list (wrapper) policy to allow reactively weeding latent nodes from the pool
Another idea for a quick-and-dirty reactive approach would be to shut down the CQL port on the impacted node (forcing clients to fail-over, and route around).
$ nodetool disablebinary