Hinted hand-off and read-repair are optimizations for converging on consistency, but cannot be considered a 100% solution, making regular anti-entropy repairs a necessity.
Establish a mechanism for regular automated invocations of repairs, as well as a documented process detailing when and how to manually execute repairs in the wake of outages.
NOTE: [[ https://phabricator.wikimedia.org/T108611 | Previous attempts ]] have made it fairly obvious that full repairs aren't tractable (take too long), so incremental repairs will almost certainly be needed.
== local_group_wiktionary_T_parsoid_html test repairs ==
In the interest of testing, numerous incremental repairs have been performed in the past weeks against the local_group_wiktionary_T_parsoid_html.data table (both DC-local and cluster-wide) from a node in codfw. Repair intervals have ranged from continuous (one after another), to being spaced apart by as much as 2 days (though most where performed on a daily cadence). Elapsed times for these repairs varied //wlldly//, but the table below provides an approximation of what can be expected for this keyspace.
| Type | Timing | Comments |
| DC parallel | 10-21 hours | |
| Parallel | 3-8 hours | |
| Local | 15-90 minutes (?) | |
| Job threads = 2 | 1-2 hours | Noticeable impact (iowait, disk throughput, latency)|
NOTE: The wide variation in timings seems to hinge largely upon available compaction throughput. Validation and anti-compaction are both constrained by the number of compactors and the configured throughput limit, both of which are instance-wide resources.
=== Observations ===
With incremental repairs, the relative size of a keyspace will matter less than the rate of change (i.e. the amount of unrepaired data that has accumulated since the last repair); local_group_wiktionary_T_parsoid_html is a small table (~1/10th the size of local_group_wikipedia_T_parsoid_html), but more importantly, it has a relatively low rate of change. This makes it difficult to reason about repair times for a larger/busier tables without more testing (though it seems obvious that they will take considerably longer).
Given our 3-rack/3-replica per data-center symmetry, we could repair the entire cluster by repairing all of the nodes within a single rack. These repairs would have to be performed sequentially, as you can not issue concurrent repairs of the same table. At present, we have 9 nodes to a rack, so working backward from there, we would need to be able to complete each node in 2h20m to achieve a daily cadence, (2 hours after the upcoming expansion increases the per rack node count to 12). This seems unlikely, even for our lower traffic tables. Longer intervals may be possible, but would require testing for higher traffic tables (since more writes will also increase the amount of repair work needed).
Given the constraints mentioned above, repairs would need to be carefully orchestrated; Cron jobs almost certainly be unsuitable for this.