Page MenuHomePhabricator

Investigate recent Cassandra cluster performance issues
Closed, ResolvedPublic

Description

There are a number of concerns regarding Cassandra cluster performance.

  • Since provisioned, eqiad nodes 1007, 1008, and 1009 have been outliers on a number of key performance metrics.
  • For most of this month, connection timeouts have trended in a negative direction.
  • More recently, 1007 has been exhibiting higher than usual client read latency
  • ...and earlier this morning, 1007 died with an OOM exception.

Note: It is possible likely that one or more of these are unrelated, so for now, this issue is a catch-all for the general investigation; Additional tickets will be opened as needed.

See also: T116861: Investigate OOM and elevated read latencies on 1007

Event Timeline

Eevans claimed this task.
Eevans raised the priority of this task from to High.
Eevans updated the task description. (Show Details)
Eevans added a project: RESTBase-Cassandra.
Eevans added a subscriber: Eevans.

T116861 is tracking the recent issues on 1007 in particular.

That 1007, 1008, and 1009 are outliers in the general sense is due to the lower hardware specs on these machines (for example, disk read throughput is higher because there is less memory for page cache).

I believe the remaining issues cited were all addressed by T116861: Investigate OOM and elevated read latencies on 1007.

Closing.