Page MenuHomePhabricator

Throttle compaction throughput limit in line with instance count
Closed, ResolvedPublic

Description

Back in autumn 2015 we were struggling with compactions in large instances falling behind. At the time, we were using leveled compaction & a single instance per hardware node. To keep up, we increased the compaction throughput limit as well as compaction parallelism from the default settings.

We have since made several changes that reduce the need for such high compaction throughput:

  • We moved from LCS to DTCS, reducing the overall compaction activity.
  • We moved from one instance to 2-3 instances per hardware node.

We are also seeing relatively high bursts of IO activity, leading to high iowait. This might be contributing to relatively high p99 latencies (see T140286).

Proposal: Reduce compaction throughput limit in line with instance count & strategy change

Our goal is to keep the impact of background compaction activity to a minimum, while also ensuring timely compaction without backlogs. Our current compaction throughput is significantly higher than what should be needed in a multi-instance DTCS setup. To remedy this, I am proposing to gradually lower compaction throughput limits:

  • Start with 1/3 the current value, in line with instance count.
  • From there, continue reducing the limit until outstanding compactions metrics show minor movement.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 19 2016, 8:30 PM
GWicke triaged this task as Medium priority.Jul 19 2016, 8:30 PM

Mentioned in SAL [2016-07-19T21:23:21Z] <gwicke> temporarily lowered compaction throughput on all 1012 instances from 60mb/s to 20mb/s via nodetool setcompactionthroughput 20 (T140825)

Mentioned in SAL [2016-07-19T21:43:43Z] <gwicke> temporarily lowering compaction throughput on all eqiad restbase cassandra instances from 60mb/s to 20mb/s via nodetool setcompactionthroughput 20 (T140825)

Effect on iowait:

disk write throughput:

GC time:

Read latency has only shown moderate movement so far. It is possible that the effect is overshadowed by the bootstrap on 1013. We'll find out once that is done.

Eevans moved this task from Backlog to In-Progress on the Cassandra board.Jul 20 2016, 1:02 PM

Change 300056 had a related patch set uploaded (by Eevans):
RESTBase Cassandra: Lower compaction throughput to 20MB/s

https://gerrit.wikimedia.org/r/300056

Longer term effect on iowait:

Overall read latency has dropped for the periods where the bootstraps were not ongoing:

This means that this change addressed at least a part of T140286: Elevated 99p RESBase storage latencies latencies.

Compaction throughput / backlogging has not been an issue so far, but we should also keep an eye on that.

GWicke added a comment.EditedJul 20 2016, 7:55 PM

One more potential improvement I noticed while investigating this is that write IO is still relatively bursty, with no or few writes followed by a burst every five seconds or so. Our current trickle fsync interval is 30mb, which is rather large. I think it's worth lowering this significantly, perhaps to 8mb. All of those values are significantly above typical SSD erase block sizes of 256k.

Edit: Patch at https://gerrit.wikimedia.org/r/#/c/300100/

Change 300056 merged by Filippo Giunchedi:
RESTBase Cassandra: Lower compaction throughput to 20MB/s

https://gerrit.wikimedia.org/r/300056

Mentioned in SAL [2016-07-22T15:16:00Z] <urandom> T140825: Restarting Cassandra to apply 8MB trickle_fsync (restbase1015-a.eqiad.wmnet)

Mentioned in SAL [2016-07-25T15:34:35Z] <urandom> T140825, T134016: Reststarting Cassandra to apply stream timeout, and 8MB trickle_fsync (restbase1008-a.eqiad.wmnet)

Mentioned in SAL [2016-07-25T15:39:29Z] <urandom> T140825, T134016: Reststarting Cassandra to apply stream timeout, and 8MB trickle_fsync (restbase1008-b.eqiad.wmnet)

Mentioned in SAL [2016-07-25T15:43:06Z] <urandom> T140825, T134016: Reststarting Cassandra to apply stream timeout, and 8MB trickle_fsync (restbase1008-c.eqiad.wmnet)

I did locally try the alternative of starting dirty page write-back early with

sysctl -w vm.dirty_background_ratio=3

The effect of this setting is to asynchronously (without blocking) write out dirty pages in a more continuous fashion. Blocking at the VM level would only occur once reaching vm.dirty_ratio. See https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for more background.

I think it is worth trying this in addition or even as an alternative to trickle_fsync, considering how easy it is to apply temporarily.

Mentioned in SAL [2016-07-25T15:53:44Z] <urandom> T140825: Setting vm.dirty_background_bytes=24M on restbase1012.eqiad.wmnet

Mentioned in SAL [2016-07-25T15:54:10Z] <urandom> T140825, T134016: Reststarting Cassandra to apply stream timeout, and disable trickle_fsync (restbase1012-a.eqiad.wmnet)

Mentioned in SAL [2016-07-25T16:02:15Z] <urandom> T140825, T134016: Restarting Cassandra to apply stream timeout, and disable trickle_fsync (restbase1012-b.eqiad.wmnet)

Mentioned in SAL [2016-07-25T16:06:55Z] <urandom> T140825, T134016: Restarting Cassandra to apply stream timeout, and disable trickle_fsync (restbase1012-c.eqiad.wmnet)

The instances on 1008 were all restarted with a trickle_fsync value of 8M starting from ~15:30, completing at 15:43 UTC today.

The instances on 1012 where all restarted with trickle_fsync disabled, and vm.dirty_background_bytes was set to 24M starting from ~15:54, completing at 16:06.

On the whole, vm.dirty_background_bytes looks better on this rather limited experiment. Both have seen relatively little compaction activity during this time, so it might be interesting to note what happens when that is not the case. I need to bootstrap an instance into this rack (1013-c as part of T134016: RESTBase Cassandra cluster: Increase instance count to 3), which should provide the opportunity to do just that...

Based on the limited data so far, read latency seems pretty much unaffected on either host:

In theory , dirty_background_{bytes,ratio} seem like the better solution for the even-write problem, but there might still be some role for fscync in that it paces the producing process before causing major stalls. This would only happen during extremely heavy write activity, once dirty_ratio is reached.

Mentioned in SAL [2016-07-26T19:33:10Z] <urandom> T140825: Setting vm.dirty_background_bytes=24576 (restbase1009.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:33:40Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:37:41Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-b.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:42:50Z] <urandom> T140825: Setting vm.dirty_background_bytes=24576 (restbase1014.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:43:05Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:49:36Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-b.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:53:45Z] <urandom> T140825: Setting vm.dirty_background_bytes=24576 (restbase1015.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:54:06Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:58:40Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-b.eqiad.wmnet)

Change 301390 had a related patch set uploaded (by Eevans):
configurable trickle_fsync

https://gerrit.wikimedia.org/r/301390

Change 301390 merged by Elukey:
configurable trickle_fsync

https://gerrit.wikimedia.org/r/301390

Change 301425 had a related patch set uploaded (by Eevans):
Configurable vm.dirty_background_bytes parameter

https://gerrit.wikimedia.org/r/301425

Eevans moved this task from In-Progress to Blocked on the Cassandra board.Aug 3 2016, 3:26 PM

Change 301425 merged by Filippo Giunchedi:
Configurable vm.dirty_background_bytes parameter

https://gerrit.wikimedia.org/r/301425

Mentioned in SAL [2016-08-04T17:08:34Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-a.eqiad.wmnet

Mentioned in SAL [2016-08-04T17:14:51Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-b.eqiad.wmnet

Mentioned in SAL [2016-08-04T17:17:37Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-c.eqiad.wmnet

Mentioned in SAL [2016-08-04T19:37:28Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-a.eqiad.wmnet

Mentioned in SAL [2016-08-04T19:40:07Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-b.eqiad.wmnet

Mentioned in SAL [2016-08-04T19:42:22Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-c.eqiad.wmnet

Mentioned in SAL [2016-08-04T20:03:38Z] <urandom> T140825,T140869: Performing Cassandra instance rolling restart of restbase1011.eqiad.wmnet

Mentioned in SAL [2016-08-04T20:10:55Z] <urandom> T140825,T140869: Cassandra instance restarts complete: restbase1011.eqiad.wmnet

Mentioned in SAL [2016-08-04T20:13:20Z] <urandom> T140825,T140869: Performing rolling restart of codfw Cassandra instances

Mentioned in SAL [2016-08-04T21:14:18Z] <urandom> T140825,T140869: Rolling restart of codfw Cassandra instances complete

Mentioned in SAL [2016-08-04T21:26:10Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b'

Mentioned in SAL [2016-08-04T21:48:28Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b', complete

Mentioned in SAL [2016-08-04T21:50:14Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `d'

Eevans closed this task as Resolved.Aug 5 2016, 2:42 PM

The last of the changes associated with this issue have been applied to the production cluster; I think we can close this now