In testing, a quorum of cassandra nodes going down during heavy load cause restbase workers to accumulate a lot of memory until they eventually reach the configured heap limit and are restarted by the coordinator. Before the limit is reached, they tend to get fairly slow as GC tries hard to make do with the available memory.
We probably need to be more aggressive about timing out & freeing backend connections internally when the cassandra table storage layer is down. We do currently limit the number of concurrent live connections per worker. This does not help for long backend outages though, as incoming http connections are timing out, which allows other requests to come in.