Page MenuHomePhabricator

Consider disabling read repairs in RESTBase Cassandra cluster
Closed, ResolvedPublic

Description

Read repair is a mechanism in Cassandra that uses regular read requests as an opportunity to sync up replicas. When there is a disagreement, the nodes with the older (or missing) information receive a copy of the updated information.

Unfortunately, this has negative effects on lookup and compaction performance. Both lookups and compactions rule out matches on SSTables by looking at the minimum write timestamp in an SSTable. Read repaired data uses the original write's timestamp, which means that most new tables still have few old repairs in them. As a result, those SSTables are considered potentially overlapping with older SSTables, which means that

  1. they are consulted in exact-match queries, to rule out tombstones.
  2. Tombstones newer than this minimum timestamp can not easily be collected (see T150811#2858783).

Disabling read repair would remove these out-of-order writes, which is expected to improve performance and tombstone collection.

Disadvantages

The main disadvantage of disabling read repair is that inconsistencies introduced as part of operational incidents longer than the hint time window (2 hours) are no longer gradually cleaned up as part of normal reads.

However, arguably relying on read repairs for actual sync-up is a poor choice in any case. After an outage longer than the hint window, only a full repair or bootstrap can reliably bring the entire dataset back in sync. That said, we have not been able to successfully complete a full Cassandra repair on the full RESTBase dataset, so read repairs can offer a small improvement over the status quo of no repairs at all.

Event Timeline

GWicke renamed this task from Consider disabling read repairs to Consider disabling read repairs in RESTBase Cassandra cluster.Dec 13 2016, 8:47 PM
GWicke created this task.

...so read repairs can offer a small improvement over the status quo of no repairs at all.

How are you quantifying this? How small is small?

...so read repairs can offer a small improvement over the status quo of no repairs at all.

How are you quantifying this? How small is small?

As far as I know, we have metrics for the read repair rate. Do you mind checking those?

...so read repairs can offer a small improvement over the status quo of no repairs at all.

How are you quantifying this? How small is small?

As far as I know, we have metrics for the read repair rate. Do you mind checking those?

I think you may have misunderstood. I meant, for any given value(s) where is the cutoff for small (or large, or whatever). How are you quantifying "small"? I'm not saying you're wrong to categorize it as such, I'm just trying to understand.

I personally rate the benefit as "small" considering the limited coverage, but reasonable people can disagree on such ratings. I don't think anybody would consider read repairs to be as reliable and complete as regular full (or incremental) repairs. But at the same time, until we have those, there is clearly a non-zero benefit in read repairs.

Does this outweigh other potential benefits of disabling read repairs like less manual time spent on maintaining this, better performance, or lower disk usage? I personally doubt it, but this task is the place to debate this.

OK, so to be clear. It is technically not possible to disable read repair; Read repair will run whenever digest mismatches occur, and there isn't any knob to turn this off. The read_repair_chance parameter establishes a probabilistic check of the digests of replicas not involved in achieving your consistency level.

You might be able to implicitly disable RR if you set read_repair_chance to zero, and only ever queried at CL = ONE. This would probably run afoul of speculative retries though, and require that it be disabled as well (more investigation would be needed).

In production, we query at localQuorum (except secondary indexes updates which use localOne).

OK, so to be clear. It is technically not possible to disable read repair; Read repair will run whenever digest mismatches occur, and there isn't any knob to turn this off. The read_repair_chance parameter establishes a probabilistic check of the digests of replicas not involved in achieving your consistency level.

You might be able to implicitly disable RR if you set read_repair_chance to zero, and only ever queried at CL = ONE. This would probably run afoul of speculative retries though, and require that it be disabled as well (more investigation would be needed).

In production, we query at localQuorum (except secondary indexes updates which use localOne).

To add a little more information to this:

Screenshot from 2016-12-14 10-36-28.png (1×2 px, 379 KB)

The blocking metric counts all read repairs that occur because of a digest mismatch, for us that is about 2/s per node.

The background metric are read repairs that were initiated by chance (read_repair_chance).

Changing the consistency level is not an option in general.

Since Cassandra does not currently allow disabling read repairs, it appears that the best we can do to stop grossly out-of-order data being mixed into SSTables on an ongoing basis would be to get full repairs working.

In any case, it probably would not hurt to set read_repair_chance to zero, and see if it does make any difference to out-of-order writes. It does not seem likely, but it's easy enough to verify.

@Eevans: Could you determine whether there is something we can do about repairs, and then either close this task, or clearly spell out the actionable part?

@Eevans: Could you determine whether there is something we can do about repairs, and then either close this task, or clearly spell out the actionable part?

With regard to regular repairs, I think T113805: Establish a strategy for regular anti-entropy repairs remains state-of-the-art. TL;DR It will boil down to an issue of resources (cpu and disk IO); Some amount repair is possible, but it's dubious we'll have the resources for everything. That picture may change in the face of some of the proposed storage changes (T120171), but either way, we're going to need to actually try some of the more serious tables and go from there. Of course, that ticket is stalled while we've focused our attentions elsewhere, and I suspect current priorities continue to support that, but let me know if you think I'm mistaken.

All of that said, the only significant source of read-repairs (and hence, out-of-order writes) are blocking read-repairs (i.e. the result of quorum reads), which cannot be eliminated, and so I think this issue can be marked resolved.