Page MenuHomePhabricator

Investigate reducing impact of single-node Cassandra latencies
Open, LowPublic

Description

As an example, a recent RAID rebuild elevated iowait, and higher columnfamily read latencies ensued. This is not unexpected. However, the increased read latency of this one node directly translated to higher 99p RESTBase latencies. Given there are two other replicas with normal latency in this situation, the ideal behavior would be one where we routed queries around the latent node.

Ideas:

Edit:

Another idea for a quick-and-dirty reactive approach would be to shut down the CQL port on the impacted node (forcing clients to fail-over, and route around).

$ nodetool disablebinary

Event Timeline

Eevans created this task.Feb 19 2016, 4:42 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 19 2016, 4:42 PM
Eevans updated the task description. (Show Details)Feb 26 2016, 4:08 PM

Speculative retries sound the least complicated and most promising option of the bunch. Does this involve much more than a probabilistic timer triggering a retry?

GWicke added a comment.EditedMar 14 2016, 2:56 AM

Thinking about this some more, I'm actually wondering if we could implement this in restbase-mod-table-cassandra:

  • Set up a second cassandra client instance, without the token-aware load balancing policy. We could even consider dropping DC awareness as well, to better take advantage of remote replicas in case local replicas are slow.
  • On each query, start a timer to fire after a few hundred ms (or something like the 95th percentile latency), initiating a second request using the second instance, and possibly a different consistency, such as QUORUM.
  • Cancel that timer & update local latency stats if the original query returns in time.
Eevans updated the task description. (Show Details)Mar 14 2016, 9:37 PM

Thinking about this some more, I'm actually wondering if we could implement this in restbase-mod-table-cassandra:

  • Set up a second cassandra client instance, without the token-aware load balancing policy. We could even consider dropping DC awareness as well, to better take advantage of remote replicas in case local replicas are slow.
  • On each query, start a timer to fire after a few hundred ms (or something like the 95th percentile latency), initiating a second request using the second instance, and possibly a different consistency, such as QUORUM.
  • Cancel that timer & update local latency stats if the original query returns in time.

I like the idea of speculative retries, but I don't think restbase-mod-table-cassandra is the place to implement it. It adds complexity to our code, complexity that would be best encapsulated in the driver. The driver is in a position to track per-node query latency (needed to implement retries based on a quantile), and already has access to a pool of connections to other nodes. If we do this, I think we should implement support in the driver and push it upstream.

That said, the raison d'etre of this issue was a single node with aberrant latency, latency that was directly translated to RESTBase's 99p (and this has happened to us before). Something like speculative retries could provide lower, and more stable 99p overall, but in these exceptional cases, it probably makes to have in our tool set (if you will) the ability to fence off the affected node (which i think nodetool disablebinary would do).

Having looked at the driver code, it wasn't entirely clear to me where this kind of speculative retry logic would best go. In fact, it seems to conflict with existing load balancing and retry policies.

We probably want to do more than disabling token-awareness on retries, and leverage remote replicas more heavily / adjust the consistency level. This, again, might be harder to implement and sell in the driver itself, except if packaged as an optional policy.

Having looked at the driver code, it wasn't entirely clear to me where this kind of speculative retry logic would best go. In fact, it seems to conflict with existing load balancing and retry policies.
We probably want to do more than disabling token-awareness on retries, and leverage remote replicas more heavily / adjust the consistency level. This, again, might be harder to implement and sell in the driver itself, except if packaged as an optional policy.

Upstream has an open issue on this (also linked in the description). It's probably best to bring up the how there.

FWIW, speculative retries seem to have already been implemented in the Java driver.

Eevans moved this task from Backlog to Next on the Cassandra board.Aug 15 2016, 8:19 PM
GWicke triaged this task as Normal priority.Oct 12 2016, 5:58 PM
Eevans moved this task from Next to Backlog on the Cassandra board.Nov 29 2016, 9:30 PM
Eevans lowered the priority of this task from Normal to Low.Jul 3 2018, 10:22 PM
Eevans added subscribers: Pchelolo, mobrovac.

@Pchelolo It looks like speculative retries were added to 3.3.0 of the NodeJS driver, can you comment on an expected timeline for moving to it (are we moving to it?)?

This is good news! I guess we will need to investigate the upgrade together with T169009: Cassandra Node.JS driver v3.2.2 issues as that issue persisted in the whole v3.2.x branch of the driver.

Pchelolo raised the priority of this task from Low to High.Jul 4 2018, 7:55 AM
Pchelolo edited projects, added Services (doing); removed Services (later).

Seems like we just need to prioritize the upgrade. I think I will try manually upgrading the driver to the latest on one of the dev machines and running a dump.

Seems like we just need to prioritize the upgrade. I think I will try manually upgrading the driver to the latest on one of the dev machines and running a dump.

Heads-up: last time I was investigating T169009, all was fine locally, in beta as well the dev cluster, but failed miserably in production. The variable that made the difference at the time was the number of instances.