Back in autumn 2015 we were struggling with compactions in large instances falling behind. At the time, we were using leveled compaction & a single instance per hardware node. To keep up, we increased the compaction throughput limit as well as compaction parallelism from the default settings.
We have since made several changes that reduce the need for such high compaction throughput:
- We moved from LCS to DTCS, reducing the overall compaction activity.
- We moved from one instance to 2-3 instances per hardware node.
We are also seeing relatively high bursts of IO activity, leading to high iowait. This might be contributing to relatively high p99 latencies (see T140286).
Proposal: Reduce compaction throughput limit in line with instance count & strategy change
Our goal is to keep the impact of background compaction activity to a minimum, while also ensuring timely compaction without backlogs. Our current compaction throughput is significantly higher than what should be needed in a multi-instance DTCS setup. To remedy this, I am proposing to gradually lower compaction throughput limits:
- Start with 1/3 the current value, in line with instance count.
- From there, continue reducing the limit until outstanding compactions metrics show minor movement.