There are a number of concerns regarding Cassandra cluster performance.
- Since provisioned, eqiad nodes 1007, 1008, and 1009 have been outliers on a number of key performance metrics.
- For most of this month, connection timeouts have trended in a negative direction.
- More recently, 1007 has been exhibiting higher than usual client read latency
- ...and earlier this morning, 1007 died with an OOM exception.
Note: It is possible likely that one or more of these are unrelated, so for now, this issue is a catch-all for the general investigation; Additional tickets will be opened as needed.
See also: T116861: Investigate OOM and elevated read latencies on 1007