Evaluate Brotli compression for Cassandra
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Eevans
	Feb 4 2016, 10:33 PM

Description

An experimental implementation of Cassandra's ICompressor interface has been put together, which creates a shaded jar to bundle needed dependencies (including the apropos native library). Basic (locally run) tests have proven promising in terms of performance/resource utilization, stability, and compression ratio. The next step would be to develop and execute a test methodology under less contrived conditions (multiple nodes, production-like data, and representative load).

Details

	Subject	Repo	Branch	Lines +/-
	(temporarily) enable thrift rpc in staging	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies
Declined	Eevans	T93496 Improve revision compression in Cassandra / Brotli or LZMA support
Declined	Eevans	T125904 Brotli compression for Cassandra
Declined	None	T120171 RFC: Differentiate storage strategies for archival storage vs. hot current data
Declined	None	T122028 RFC: Chunked storage algorithms for archival data vs. large-window brotli compression
Declined	Eevans	T125906 Evaluate Brotli compression for Cassandra
Resolved	Eevans	T126629 Cassandra 2.2.6
Resolved	None	T127365 Cassandra upgrades in staging attempted to start root instance
Open	None	T137419 Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)
Resolved	Eevans	T137474 Investigate lack of recency bias in Cassandra histogram metrics
Resolved	Eevans	T139639 RESTBase: Cassandra 2.2.6 post-upgrade checklist

Event Timeline

Eevans created this task.Feb 4 2016, 10:33 PM

Eevans claimed this task.

Eevans raised the priority of this task from to Medium.

Eevans updated the task description. (Show Details)

Eevans added a project: RESTBase-Cassandra.

Eevans added subscribers: akosiaris, fgiunchedi, Aklapper, Eevans.

@Eevans, I have done some more tests locally, see T122028 for the results.

As a next step, I think it would be useful to test brotli with larger data sets in staging. A fairly straightforward method would be to convert the existing HTML data from deflate to brotli using nodetool scrub, and then evaluating real-world performance and stability by running a dump against the cluster.

A quick way to get this started would be to to do a manual install of the brotli compressor jar. This would allow us to start testing soon, while creating a more proper puppet-triggered install in parallel.

I went ahead and pinged @fgiunchedi to be sure he was OK with this (he is, so long as it is limited to staging).

The jar has been copied out to staging, and Cassandra restarted.

$ dsh -Mg c_test -- ls -l /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
cerium.eqiad.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
praseodymium.eqiad.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
xenon.eqiad.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
restbase-test2001.codfw.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
restbase-test2002.codfw.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
restbase-test2003.codfw.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar

I converted the following keyspaces to brotli compression:

local_group_wikipedia_T_parsoid_html
local_group_wikipedia_T_parsoid_section_offsets
local_group_wikipedia_T_parsoid_dataW* (data-parsoid)
local_group_wikipedia_T_page__titles

An enwiki dump is now running to stress cassandra.

Update: Also converted mobile lead / remainder section tables.

Which settings are being used for testing?

@mobrobac, compression = {'sstable_compression': 'org.apache.cassandra.io.compress.BrotliCompressor', 'quality': 5, 'chunk_length_kb': 'XXX'}

where XXX is:

16m for html and data-parsoid,
4m for titles,
1m for mobile lead / remainder.

In T125906#2009977, @GWicke wrote:

16m for html and data-parsoid,

4m for titles,

1m for mobile lead / remainder.

Why do mobile sections have a smaller chunk size than titles? They should contain more data (especially the remainder, which contains disproportionally more data than lead).

Mobile sections only contain one version per article, so don't gain much from compression blocks that are much larger than a single entry.

For titles, the optimal size might well be a bit smaller. I mostly wanted to see if there are ill effects from setting it a bit larger.

Edit: I lowered the revision__titles compression block size to 512k.

Update on the test in staging:

Using 8m chunks, compression approached 5% on at least one of the nodes when compacted to two SSTables.
With 8g heap and a 32m region size, GC has been keeping up with low & moderate concurrency dumps (concurrency 25, ~150 req/s).
The dump is now running with concurrency 50. CPU saturates occasionally, but so far GC is doing okay. If things still look okay tonight, we can keep this dump running over the weekend.

If stability with 8g heap & 32m region size is confirmed, key_rev_value buckets with large blocks will be a viable option for old-revision storage.

Given the limited throughput costs, we can consider using brotli with smaller block sizes in production. We can then gradually ramp up block sizes, and consider going all the way to 8m once we have a key_value bucket for current revisions in place & filled.

• mobrovac added projects: Services, SRE.Feb 12 2016, 11:28 PM

In T125906#2024014, @GWicke wrote:

Update on the test in staging:

Using 8m chunks, compression approached 5% on at least one of the nodes when compacted to two SSTables.

With 8g heap and a 32m region size, GC has been keeping up with low & moderate concurrency dumps (concurrency 25, ~150 req/s).

The dump is now running with concurrency 50. CPU saturates occasionally, but so far GC is doing okay. If things still look okay tonight, we can keep this dump running over the weekend.

Nice; It continues to look promising.

If stability with 8g heap & 32m region size is confirmed, key_rev_value buckets with large blocks will be a viable option for old-revision storage.

Given the limited throughput costs, we can consider using brotli with smaller block sizes in production. We can then gradually ramp up block sizes, and consider going all the way to 8m once we have a key_value bucket for current revisions in place & filled.

Yeah, I'm -1 on going to production with settings more aggressive memory-wise (i.e.strategy chunk_length_kb, or GC). A new compression strategy (and one that hasn't been battle-tested, to boot), carries enough risk, and introduces a lot of variability as it is. And, any timeline for ramping up chunk_length_kb (and the corresponding GC settings needed to compensate) needs to occur during a period when the cluster is under well-understood, stable, steady-state conditions. This precludes doing it while expansions are underway (T119935, T125842, and T95253), where streaming and compaction activity would make it difficult to reason about cause and effect.

That said, if we feel we have collected enough data to satisfy the question of what is possible, then I'd like to start developing a testing methodology for staging using production-like settings (as close as possible given what we have to work with). I'm sure SRE has questions here, so getting consensus from them on testing is important as well (/cc @fgiunchedi).

• GWicke added a parent task: T122028: RFC: Chunked storage algorithms for archival data vs. large-window brotli compression.Feb 15 2016, 5:10 PM

A full dump run at concurrency 50 has finished without issues. As a further stress test, I have now started another dump run, this time with concurrency 100.

• GWicke mentioned this in T122028: RFC: Chunked storage algorithms for archival data vs. large-window brotli compression.Feb 15 2016, 5:35 PM

I'm -1 on going to prod with Brotli before we finish the cluster expansion and complete the multi-instance set-up. I think we should be looking at Brotli as a long-term strategical move, not a short-term effort to reduce storage growth. I don't feel adventurous enough to introduce yet another variable in the equation at this point.

The dump with concurrency 100 did not uncover any issues over the last two days.

@Eevans, @fgiunchedi: I'm done testing for now, so the cluster is yours to perform further testing.

Regarding the deployment timeline, lets consider our options once we have an idea of how long the conversion to a multi-instance setup is going to take. See T95253: Finish conversion to multiple Cassandra instances per hardware node.

Eevans mentioned this in T126629: Cassandra 2.2.6.Mar 2 2016, 8:46 PM

Mentioned in SAL [2016-03-02T21:12:53Z] <urandom> forcing a major compaction on {local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ,local_group_wikipedia_T_parsoid_html}.data, xenon.eqiad.wmnet : T125906

Mentioned in SAL [2016-03-02T22:28:01Z] <urandom> enabling brotli compression on local_group_wikipedia_T_parsoid_html.data in staging, and forcing rewrite of corresponding tables on xenon : T125906

Change 275917 had a related patch set uploaded (by Eevans):
(temporarily) enable thrift rpc in staging

https://gerrit.wikimedia.org/r/275917

gerritbot added a project: Patch-For-Review.Mar 8 2016, 9:11 PM

Change 275917 merged by Filippo Giunchedi:
(temporarily) enable thrift rpc in staging

https://gerrit.wikimedia.org/r/275917

Eevans added a subtask: T126629: Cassandra 2.2.6.Mar 9 2016, 4:31 PM

Update: Changes to Cassandra's ICompressor interface in 2.2.5 will require some changes to the experimental implementation.

The Good News though, is that jbrotli does support direct byte buffers, and if the LZ4 results are any indication, we stand to see significant performance improvements.

In T125906#2103414, @Eevans wrote:

Update: Changes to Cassandra's ICompressor interface in 2.2.5 will require some changes to the experimental implementation.

The Good News though, is that jbrotli does support direct byte buffers, and if the LZ4 results are any indication, we stand to see significant performance improvements.