Page MenuHomePhabricator

Option: Consider switching back to leveled compaction (LCS)
Closed, DeclinedPublic


As discussed in T94121 and T150811, leveled compaction could increase data locality, and limit the number of SSTables that need to be touched for a read. However, there are also issues with LCS. This task is to collect information on both pros & cons of using LCS, to be used as part of creating an overall plan.


  • Read performance. Splits token range into non-overlapping SSTables. Total number of touched SSTables limited by the number of levels. For reasonably-sized instances, this number is 5 or 6.
  • Handles skew updates efficiently by compacting busy token range more frequently than rately updated ones.


  • SSTables containing a frequently updated partition are still considered overlapping with those in lower levels for compaction purposes, as Cassandra only considers the presence of a partition & the write time of an entire SSTable. Tombstones for wide rows are only collected in the lowest level, despite not actually shadowing any data in the lower levels in the normal case.

Other considerations

  • Write amplification is normally higher with LCS than with other standard strategies. However, in practice STCS and TWCS both do not offer sufficient read performance & timely tombstone collection, and are thus combined with ad-hoc manual compactions. It is unclear how the overall write amplification including those manual major compactions compares with LCS.

Event Timeline

GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
Ottomata triaged this task as Medium priority.Mar 6 2017, 7:19 PM

@Eevans, I know you have switched some keyspaces on the dev cluster from TWCS to LCS. What has been the effect on data size, compaction throughput, and possibly read latencies?

@Eevans, I know you have switched some keyspaces on the dev cluster from TWCS to LCS. What has been the effect on data size, compaction throughput, and possibly read latencies?

I have done what I would categorize as 'casual testing' so far, and I fear that has surfaced more questions than it has answered.

As an experiment, I used 5GB table sizes (quite large). At our instances sizes (for Parsoid html) this has thus far worked out to pretty close to a 10 table level 1, in addition to what is in level 0 (a couple of instances do also have a 2 or 3 table level 2). Read latency is pretty good, with a 99p SSTables/read of ~3. What is unclear to me is how tombstone GC is working.

If you look at this paste, you can see that the tables in level 1 nominally have 0% droppable tombstones, while the tables in level 0 are > 200%, which is clearly pretty wrong. As an experiment, I forced a re-level, and afterward it looked like this. Doing this dropped the data size by more than half, from 56G to 23G.

It's my expectation that tombstone GC will be less than ideal for us in this use case. My main concern is a) we're able to understand/quantify how it is performing, and that b) we find a configuration that is sustainable (that even if not an ideal value, the ratio of droppable data is stable).

I'll work on a plan for getting some more definitive answers and update/open issues accordingly.

I feel this is pretty much a non-starter for RESTBase use-cases. The write amplification can be quite problematic, as T180568: Aberrant load on instances involved in recent bootstrap demonstrated. Additionally, our storage strategies generate a significant number of tombstones, and LCS's tombstone GC is legendarily poor.

Conversely, the use of size-tiered compaction with the new range deleted-based strategy has proven quite effective; No need to fix what is not broken.