Page MenuHomePhabricator

Excessive number of idle Cassandra connections
Open, LowPublic

Description

The most frequent request in the RESTBase Cassandra cluster is RangeSlice, a type of query our application doesn't use. The query in question is a SELECT key FROM system.local, used as a heartbeat by the NodeJS driver.

Screenshot from 2018-03-16 10-47-30.png (811×1 px, 181 KB)

This doesn't seem to be hurting anything, but it's quite excessive (it is the most commonly executed query in the cluster), and obfuscates performance metrics.

By default, pooling.heartBeatInterval is set to 30000ms, so any connection that is idle for 30 seconds will result in one of these queries. The documentation also states that the connection pool is only maintaining a single connection to each host in the local and remote data-centers.

In RESTBase, we use num_workers: ncpu, and have 816 cpu units total in the cluster. There are 24 hosts running 3 Cassandra instances each, for a total of 72. Since each worker has its own connection pool, there are 816 * 72 open connections typically, or 58752. We see ~3000 client requests per second, and if roughly half of those are RangeSlice, then our client request load is closer to ~1500/s; 58752 is far in excess of what we need to handle ~1500 qps, and explains why so many are idle (and in turn generating the high rate of keep alives).

Possible Solutions:

  • Increase pooling.heartBeatInterval
  • Decrease the number of RESTBase workers
  • Upgrade driver to 3.5.0 (hides keep alives from request metrics)
  • ???

Event Timeline

Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

Upgrade driver to 3.5.0 (hides keep alives from request metrics)

I think it's time to attempt the driver upgrade anyway, but we can't do that without a working dev environment.

mobrovac subscribed.

Agreed that we have to upgrade the driver, but just hiding the metrics does not sound like a solution to me here. Given the amount of connections kept open, I think we should explore two possibilities:

  • increase the heart-beat interval (perhaps something like 120s would be enough)
  • try to limit the amount of connections kept by the driver: given Cassandra's distributed nature I am having a hard time understanding why does it need to keep the connections to every single instance open

Agreed that we have to upgrade the driver, but just hiding the metrics does not sound like a solution to me here. Given the amount of connections kept open, I think we should explore two possibilities:

  • increase the heart-beat interval (perhaps something like 120s would be enough)

We can do this provided we establish that a connection won't hang up in that amount of time. And setting it higher still may be an option.

  • try to limit the amount of connections kept by the driver: given Cassandra's distributed nature I am having a hard time understanding why does it need to keep the connections to every single instance open

Without spelunking too deeply into the NodeJS driver code, I'd say it is to a) intelligently route requests to a coordinator that has the data (and in doing so, eliminate a hop of routing stretch), b) to load-balance among the subset of nodes selected in (a), and c) to route around failures in doing so (these are the common conventions, anyway). This would all require the full compliment of nodes to do correctly.

I think one assumption that these drivers make is that the connection pool is a singleton. Each of these connection objects can do concurrency (async), and the pools are able to spawn more connection objects as needed. We're scaling the number of workers, and by virtue the number of pools, by the number processors (32-40), per logical server (to scale NodeJS), and then scaling the number of logical servers by the number hosts in the cluster (which has historically been driven by storage, not request volume). In other words, we have lots of idle Cassandra connections, but I suspect we also have a lot of idle servers/workers as well.

So in the interest of playing Devil's Advocate I'd ask a similar question: Do we need so many servers and workers?

Without spelunking too deeply into the NodeJS driver code, I'd say it is to a) intelligently route requests to a coordinator that has the data (and in doing so, eliminate a hop of routing stretch), b) to load-balance among the subset of nodes selected in (a), and c) to route around failures in doing so (these are the common conventions, anyway). This would all require the full compliment of nodes to do correctly.

Right, I meant I don't know why all of these connections are established from the get-go. A better approach IMHO would be to keep the amount of connections minimal at start-up and then open new ones as needed. This would mean that the first queries to hit certain nodes would be slower, but I think that's a good balance. One could argue that in doing so all of the connections would be established eventually, but given the amount of idle connections, I don't think this would happen in practice in our case.

So in the interest of playing Devil's Advocate I'd ask a similar question: Do we need so many servers and workers?

I looked at the usage on a couple of random nodes in each DC, and, on average, workers are using ~3% CPU in eqiad and ~7% CPU in codfw, so we could also try and to lower the number of workers to something like 24 per host, which means a grand total of 8 * 12 * 2 == 192 workers less overall. That said, we do need to take into account the fact that are many operations/requests that RESTBase handles that are not Cassandra-related, so we need to be careful there.

Without spelunking too deeply into the NodeJS driver code, I'd say it is to a) intelligently route requests to a coordinator that has the data (and in doing so, eliminate a hop of routing stretch), b) to load-balance among the subset of nodes selected in (a), and c) to route around failures in doing so (these are the common conventions, anyway). This would all require the full compliment of nodes to do correctly.

Right, I meant I don't know why all of these connections are established from the get-go. A better approach IMHO would be to keep the amount of connections minimal at start-up and then open new ones as needed. This would mean that the first queries to hit certain nodes would be slower, but I think that's a good balance. One could argue that in doing so all of the connections would be established eventually, but given the amount of idle connections, I don't think this would happen in practice in our case.

Assuming the load-balancing policies work as advertised, it would (should) establish connections with all of them at some point. We'd be regularly incurring the overhead of connection setup/teardown; I think this would ultimately defeat the optimizations that pooling is meant to provide.

Also, TTBMK, we'd need to create (and test, and maintain) our own, non-standard, policies to accomplish this. There is a cost to be paid in going this route.

So in the interest of playing Devil's Advocate I'd ask a similar question: Do we need so many servers and workers?

I looked at the usage on a couple of random nodes in each DC, and, on average, workers are using ~3% CPU in eqiad and ~7% CPU in codfw, so we could also try and to lower the number of workers to something like 24 per host, which means a grand total of 8 * 12 * 2 == 192 workers less overall. That said, we do need to take into account the fact that are many operations/requests that RESTBase handles that are not Cassandra-related, so we need to be careful there.

During the transition to the new strategy, we ended up with only a handful of cluster hosts running RESTBase at all, how did they fare? I assumed at the time OK, but confess I didn't pay much attention.