Hinted hand-off and read-repair are optimizations for converging on consistency, but cannot be considered a 100% solution, making regular anti-entropy repairs a necessity.
Establish a mechanism for regular automated invocations of repairs, as well as a documented process detailing when and how to manually execute repairs in the wake of outages.
local_group_wiktionary_T_parsoid_html test repairs
In the interest of testing, numerous incremental repairs have been performed in the past weeks against the local_group_wiktionary_T_parsoid_html.data table (both DC-local and cluster-wide) from a node in codfw. Repair intervals have ranged from continuous (one after another), to being spaced apart by as much as 2 days (though most where performed on a daily cadence). Elapsed times for these repairs varied wlldly, but the table below provides an approximation of what can be expected for this keyspace.
Type | Timing | Comments |
---|---|---|
DC parallel | 10-21 hours | |
Parallel | 3-8 hours | |
Local | 15-90 minutes (?) | |
Job threads = 2 | 1-2 hours | Noticeable impact (iowait, disk throughput, latency) |
Observations
With incremental repairs, the relative size of a keyspace will matter less than the rate of change (i.e. the amount of unrepaired data that has accumulated since the last repair); local_group_wiktionary_T_parsoid_html is a small table (~1/10th the size of local_group_wikipedia_T_parsoid_html), but more importantly, it has a relatively low rate of change. This makes it difficult to reason about repair times for a larger/busier tables without more testing (though it seems obvious that they will take considerably longer).
Given our 3-rack/3-replica per data-center symmetry, we could repair the entire cluster by repairing all of the nodes within a single rack. These repairs would have to be performed sequentially, as you can not issue concurrent repairs of the same table. At present, we have 9 nodes to a rack, so working backward from there, we would need to be able to complete each node in 2h20m to achieve a daily cadence, (2 hours after the upcoming expansion increases the per rack node count to 12). This seems unlikely, even for our lower traffic tables. Longer intervals may be possible, but would require testing for higher traffic tables (since more writes will also increase the amount of repair work needed).
Given the constraints mentioned above, repairs would need to be carefully orchestrated; Cron jobs almost certainly be unsuitable for this.
Moving forward
I propose the following:
- Cease repairs of wiktionary_T_parsoid_html, and mark SSTables as unrepaired
- Begin w/ incremental repairs of wikipedia_T_parsoid_html:
- Mark SSTables as repaired
- Begin w/ manually initiated, incremental, sequential, full-DC repairs from one rack
- Evaluate orchestration needs based on manually initiated repairs
- If repairs prove tractable (comple within a reasonable interval), orchestrate automatically initiated repairs
- Based on capacity, revisit #2 above with another table
Rationale
Assuming we can only repair a subset of all tables, wiktionary_T_parsoid_html isn't the lowest hanging fruit, and so we should not expend resources repairing it while we evaluate the feasibility of repairing higher traffic tables. If we are not to continue repairs, marking the files unrepaired will restore the table to a single compaction pool.
The proposal to begin repairs of wikipedia_T_parsoid_html by marking all of the tables as repaired, is based on the belief that completing a full repair first may be intractable. The downside to this, is that in the absence of a full repair upfront, missing data will not be excluded, and read-repairs will still generate some out-of-order writes.
I suspect that it is unrealistic to think that we can orchestrate the sequencing of repairs with simple cron jobs, but it would not hurt to first see what we are dealing with before pulling something like reaper into the equation.