Page MenuHomePhabricator

Establish a strategy for regular anti-entropy repairs
Open, MediumPublic

Description

Hinted hand-off and read-repair are optimizations for converging on consistency, but cannot be considered a 100% solution, making regular anti-entropy repairs a necessity.

Establish a mechanism for regular automated invocations of repairs, as well as a documented process detailing when and how to manually execute repairs in the wake of outages.

NOTE: Previous attempts have made it fairly obvious that full repairs aren't tractable (take too long), so incremental repairs will almost certainly be needed.

local_group_wiktionary_T_parsoid_html test repairs

In the interest of testing, numerous incremental repairs have been performed in the past weeks against the local_group_wiktionary_T_parsoid_html.data table (both DC-local and cluster-wide) from a node in codfw. Repair intervals have ranged from continuous (one after another), to being spaced apart by as much as 2 days (though most where performed on a daily cadence). Elapsed times for these repairs varied wlldly, but the table below provides an approximation of what can be expected for this keyspace.

TypeTimingComments
DC parallel10-21 hours
Parallel3-8 hours
Local15-90 minutes (?)
Job threads = 21-2 hoursNoticeable impact (iowait, disk throughput, latency)
NOTE: The wide variation in timings seems to hinge largely upon available compaction throughput. Validation and anti-compaction are both constrained by the number of compactors and the configured throughput limit, both of which are instance-wide resources.

Observations

With incremental repairs, the relative size of a keyspace will matter less than the rate of change (i.e. the amount of unrepaired data that has accumulated since the last repair); local_group_wiktionary_T_parsoid_html is a small table (~1/10th the size of local_group_wikipedia_T_parsoid_html), but more importantly, it has a relatively low rate of change. This makes it difficult to reason about repair times for a larger/busier tables without more testing (though it seems obvious that they will take considerably longer).

Given our 3-rack/3-replica per data-center symmetry, we could repair the entire cluster by repairing all of the nodes within a single rack. These repairs would have to be performed sequentially, as you can not issue concurrent repairs of the same table. At present, we have 9 nodes to a rack, so working backward from there, we would need to be able to complete each node in 2h20m to achieve a daily cadence, (2 hours after the upcoming expansion increases the per rack node count to 12). This seems unlikely, even for our lower traffic tables. Longer intervals may be possible, but would require testing for higher traffic tables (since more writes will also increase the amount of repair work needed).

Given the constraints mentioned above, repairs would need to be carefully orchestrated; Cron jobs almost certainly be unsuitable for this.

Moving forward

I propose the following:

  1. Cease repairs of wiktionary_T_parsoid_html, and mark SSTables as unrepaired
  2. Begin w/ incremental repairs of wikipedia_T_parsoid_html:
    1. Mark SSTables as repaired
    2. Begin w/ manually initiated, incremental, sequential, full-DC repairs from one rack
    3. Evaluate orchestration needs based on manually initiated repairs
    4. If repairs prove tractable (comple within a reasonable interval), orchestrate automatically initiated repairs
  3. Based on capacity, revisit #2 above with another table
Rationale

Assuming we can only repair a subset of all tables, wiktionary_T_parsoid_html isn't the lowest hanging fruit, and so we should not expend resources repairing it while we evaluate the feasibility of repairing higher traffic tables. If we are not to continue repairs, marking the files unrepaired will restore the table to a single compaction pool.

The proposal to begin repairs of wikipedia_T_parsoid_html by marking all of the tables as repaired, is based on the belief that completing a full repair first may be intractable. The downside to this, is that in the absence of a full repair upfront, missing data will not be excluded, and read-repairs will still generate some out-of-order writes.

I suspect that it is unrealistic to think that we can orchestrate the sequencing of repairs with simple cron jobs, but it would not hurt to first see what we are dealing with before pulling something like reaper into the equation.


See also:
https://github.com/spotify/cassandra-reaper

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke added a subscriber: GWicke.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2015, 9:28 PM
GWicke updated the task description. (Show Details)Sep 25 2015, 9:29 PM
GWicke set Security to None.
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)Sep 25 2015, 9:32 PM
GWicke edited subscribers, added: Eevans; removed: Aklapper.
GWicke updated the task description. (Show Details)Sep 28 2015, 3:39 PM
GWicke updated the task description. (Show Details)Sep 28 2015, 3:43 PM
GWicke updated the task description. (Show Details)
Eevans triaged this task as Medium priority.Sep 20 2016, 8:36 PM
Eevans renamed this task from Figure out a cross-DC repair strategy to Establish a strategy for regular anti-entropy repairs.Sep 20 2016, 8:39 PM
Eevans edited projects, added Services; removed RESTBase.
Eevans updated the task description. (Show Details)Sep 20 2016, 8:59 PM

What are the concrete steps to move towards regularly scheduled incremental repairs in prod?

  • What needs to be tested in staging?
  • Which infrastructure is needed to kick those repairs off at regular intervals?
GWicke added a comment.EditedOct 7 2016, 7:02 PM

We discussed this on IRC, and agreed that the main question around repairs will be load / performance impact. If the impact of incremental repair runs is low & those repairs finish quickly, then independent cron jobs with randomized timing would be enough. If absolutely sequential execution is required, then we'll need a centrally coordinated setup. @Eevans is collecting data on this in staging, and as a next step we can do a similar test with actual data sizes in codfw once TWCS is rolled out.

Eevans moved this task from Backlog to In-Progress on the Cassandra board.Oct 20 2016, 5:41 PM
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2016-10-20T17:42:56Z] <urandom> T133395, T113805: Starting a primary-range, incremental repair of local_group_wiktionary_T_parsoid_html.data on restbase2001.codfw.wmnet

Eevans updated the task description. (Show Details)Nov 7 2016, 11:11 PM
Eevans added a comment.Nov 8 2016, 7:50 PM
This comment was removed by Eevans.
Eevans updated the task description. (Show Details)Nov 10 2016, 3:57 PM
Eevans moved this task from In-Progress to Blocked on the Cassandra board.Dec 21 2016, 5:33 PM
GWicke moved this task from Backlog to later on the Services board.Jul 11 2017, 10:23 PM
GWicke edited projects, added Services (later); removed Services.