The most frequent request in the RESTBase Cassandra cluster is RangeSlice, a type of query our application doesn't use. The query in question is a SELECT key FROM system.local, used as a heartbeat by the NodeJS driver.
This doesn't seem to be hurting anything, but it's quite excessive (it is the most commonly executed query in the cluster), and obfuscates performance metrics.
By default, pooling.heartBeatInterval is set to 30000ms, so any connection that is idle for 30 seconds will result in one of these queries. The documentation also states that the connection pool is only maintaining a single connection to each host in the local and remote data-centers.
In RESTBase, we use num_workers: ncpu, and have 816 cpu units total in the cluster. There are 24 hosts running 3 Cassandra instances each, for a total of 72. Since each worker has its own connection pool, there are 816 * 72 open connections typically, or 58752. We see ~3000 client requests per second, and if roughly half of those are RangeSlice, then our client request load is closer to ~1500/s; 58752 is far in excess of what we need to handle ~1500 qps, and explains why so many are idle (and in turn generating the high rate of keep alives).
Possible Solutions:
- Increase pooling.heartBeatInterval
- Decrease the number of RESTBase workers
- Upgrade driver to 3.5.0 (hides keep alives from request metrics)
- ???