Cassandra compaction is getting behind
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Mar 18 2015, 9:19 PM

Description

Since RESTBase's release, the number of pending compactions in Cassandra has been trending upward. This will eventually cause problems.

Throughput is currently set to the default of 16MB/s, which is quite conservative for our environment, particularly given our use of Leveled Compaction.

If there are no objections, I'm going to start gradually increasing it (ephemerally, using nodetool), and observe the results.

cassandra_pending_compactions-00.png (656×853 px, 59 KB)

Details

Subject	Repo	Branch	Lines +/-
increased compaction concurrency and throughput	operations/puppet	production	+2 -2
increase compaction throughput and concurrency	operations/puppet	production	+3 -1
overrid-able concurrent_compactors setting	operations/puppet/cassandra	master	+6 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Eevans	T93079 (nodetool) cleanup needed on restbase1006
		Resolved		Eevans	T93140 Cassandra compaction is getting behind

Event Timeline

Eevans created this task.Mar 18 2015, 9:19 PM

Eevans claimed this task.

Eevans raised the priority of this task from to High.

Eevans updated the task description. (Show Details)

Eevans added projects: RESTBase, RESTBase-Cassandra, acl*sre-team.

Eevans added subscribers: Eevans, fgiunchedi, akosiaris.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 18 2015, 9:19 PM

$ for host in 1 2 3 4 5 6; do echo -n "restbase100$host: "; nodetool -h restbase100$host getcompactionthroughput; done
restbase1001: Current compaction throughput: 24 MB/s
restbase1002: Current compaction throughput: 24 MB/s
restbase1003: Current compaction throughput: 24 MB/s
restbase1004: Current compaction throughput: 24 MB/s
restbase1005: Current compaction throughput: 24 MB/s
restbase1006: Current compaction throughput: 24 MB/s

In T93140#1130297, @Eevans wrote:

$ for host in 1 2 3 4 5 6; do echo -n "restbase100$host: "; nodetool -h restbase100$host

fyi you can do for i in {1..6} :)

Compaction throughput is now set at 128 MB/s, but with only 2 compaction threads, actual throughput seems to be limited to about ~50MB/s. Pending compactions seems to be leveling off a bit though, so I'm going to leave everything as-is for the evening, and look into raising concurrency tomorrow.

cassandra_pending_compactions-01.png (502×1 px, 54 KB)

After letting it run over night, it would seem that 1-4 are in OK shape, steadily trending down, but 5 and 6 (the two nodes involved in the recent bootstrap operation) are headed for trouble.

I'll put together a patch for configuring a larger compaction thread pool.

cassandra_pending_compactions-02.png (444×1 px, 59 KB)

Change 197911 had a related patch set uploaded (by Eevans):
overrid-able concurrent_compactors setting

https://gerrit.wikimedia.org/r/197911

gerritbot added a project: Patch-For-Review.Mar 19 2015, 3:19 PM

Change 197915 had a related patch set uploaded (by Eevans):
increase compaction throughput and concurrency

https://gerrit.wikimedia.org/r/197915

thanks for looking into this! what are reasonable thresholds we should alert on?

Change 197911 merged by Ori.livneh:
overrid-able concurrent_compactors setting

https://gerrit.wikimedia.org/r/197911

Change 197915 merged by Faidon Liambotis:
increase compaction throughput and concurrency

https://gerrit.wikimedia.org/r/197915

Eevans mentioned this in T93079: (nodetool) cleanup needed on restbase1006.Mar 20 2015, 2:33 PM

Eevans added a parent task: T93079: (nodetool) cleanup needed on restbase1006.

Update: pending compactions are now all trending downward. Monitoring will continue.

cassandra_pending_compactions-03.png (500×1 px, 88 KB)

• GWicke moved this task from Backlog to In progress on the RESTBase board.Mar 21 2015, 12:02 AM

faidon mentioned this in rOPUP1f0a235d0298: increase compaction throughput and concurrency.Mar 23 2015, 3:53 PM

Eevans mentioned this in rOPCAbced31b2a07e: overrid-able concurrent_compactors setting.Mar 23 2015, 3:53 PM

Change 198781 had a related patch set uploaded (by Eevans):
increased compaction concurrency and throughput

https://gerrit.wikimedia.org/r/198781

Change 198781 merged by Gage:
increased compaction concurrency and throughput

https://gerrit.wikimedia.org/r/198781

• Gage mentioned this in rOPUP98a7c41229cd: increased compaction concurrency and throughput.Mar 24 2015, 5:39 PM

We also enabled trickle_fsync, which made a big difference to latency under heavy write load by writing changes out continuously rather than waiting for the VM subsystem to flush dirty pages in big bursts. We can now run full compactions with no noticeable request latency impact & iowait limited to around 1%.

Overall it looks like the new settings have given us a bit of space to keep up with compactions without affecting latency. We are however also seeing typical signs of more storage per instance not being such a good idea (GC primarily), so are looking into setting up multiple instances per hw node. See T93790: Expand RESTBase cluster capacity for a discussion of the options.

@Eevans, should we close this task now, or should we keep it open until we have an alert for the compaction backlog?

Eevans mentioned this in T78514: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts.Mar 24 2015, 8:31 PM

@GWicke, I think it should be closed; The original issue is for all intents solved. We can track the threshold alert in T78514, or create a new one.

Eevans closed this task as Resolved.Mar 24 2015, 8:37 PM

Eevans updated the task description. (Show Details)

Eevans set Security to None.

Eevans mentioned this in rOPUPbced31b2a07e: overrid-able concurrent_compactors setting.Jun 25 2015, 9:06 PM

	F102095: cassandra_pending_compactions-03.png
	Mar 20 2015, 8:13 PM

	F100508: cassandra_pending_compactions-02.png
	Mar 19 2015, 2:43 PM

	F100042: cassandra_pending_compactions-01.png
	Mar 19 2015, 1:00 AM

	F100003: cassandra_pending_compactions-00.png
	Mar 18 2015, 9:19 PM

Cassandra compaction is getting behindClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Cassandra compaction is getting behind
Closed, ResolvedPublic
Actions

Related Objects
Search...