Page MenuHomePhabricator

Evaluate Brotli compression for Cassandra
Closed, DeclinedPublic

Description

An experimental implementation of Cassandra's ICompressor interface has been put together, which creates a shaded jar to bundle needed dependencies (including the apropos native library). Basic (locally run) tests have proven promising in terms of performance/resource utilization, stability, and compression ratio. The next step would be to develop and execute a test methodology under less contrived conditions (multiple nodes, production-like data, and representative load).

Related Objects

Event Timeline

Eevans claimed this task.
Eevans raised the priority of this task from to Medium.
Eevans updated the task description. (Show Details)
Eevans added a project: RESTBase-Cassandra.
Eevans added subscribers: akosiaris, fgiunchedi, Aklapper, Eevans.

@Eevans, I have done some more tests locally, see T122028 for the results.

As a next step, I think it would be useful to test brotli with larger data sets in staging. A fairly straightforward method would be to convert the existing HTML data from deflate to brotli using nodetool scrub, and then evaluating real-world performance and stability by running a dump against the cluster.

A quick way to get this started would be to to do a manual install of the brotli compressor jar. This would allow us to start testing soon, while creating a more proper puppet-triggered install in parallel.

I went ahead and pinged @fgiunchedi to be sure he was OK with this (he is, so long as it is limited to staging).

The jar has been copied out to staging, and Cassandra restarted.

$ dsh -Mg c_test -- ls -l /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
cerium.eqiad.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
praseodymium.eqiad.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
xenon.eqiad.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
restbase-test2001.codfw.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
restbase-test2002.codfw.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar
restbase-test2003.codfw.wmnet: -rw-r--r-- 1 root root 2658206 Feb  8 18:22 /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar

I converted the following keyspaces to brotli compression:

local_group_wikipedia_T_parsoid_html
local_group_wikipedia_T_parsoid_section_offsets
local_group_wikipedia_T_parsoid_dataW* (data-parsoid)
local_group_wikipedia_T_page__titles

An enwiki dump is now running to stress cassandra.

Update: Also converted mobile lead / remainder section tables.

Which settings are being used for testing?

@mobrobac, compression = {'sstable_compression': 'org.apache.cassandra.io.compress.BrotliCompressor', 'quality': 5, 'chunk_length_kb': 'XXX'}

where XXX is:

  • 16m for html and data-parsoid,
  • 4m for titles,
  • 1m for mobile lead / remainder.
  • 16m for html and data-parsoid,
  • 4m for titles,
  • 1m for mobile lead / remainder.

Why do mobile sections have a smaller chunk size than titles? They should contain more data (especially the remainder, which contains disproportionally more data than lead).

Mobile sections only contain one version per article, so don't gain much from compression blocks that are much larger than a single entry.

For titles, the optimal size might well be a bit smaller. I mostly wanted to see if there are ill effects from setting it a bit larger.

Edit: I lowered the revision__titles compression block size to 512k.

Update on the test in staging:

  • Using 8m chunks, compression approached 5% on at least one of the nodes when compacted to two SSTables.
  • With 8g heap and a 32m region size, GC has been keeping up with low & moderate concurrency dumps (concurrency 25, ~150 req/s).
  • The dump is now running with concurrency 50. CPU saturates occasionally, but so far GC is doing okay. If things still look okay tonight, we can keep this dump running over the weekend.

If stability with 8g heap & 32m region size is confirmed, key_rev_value buckets with large blocks will be a viable option for old-revision storage.

Given the limited throughput costs, we can consider using brotli with smaller block sizes in production. We can then gradually ramp up block sizes, and consider going all the way to 8m once we have a key_value bucket for current revisions in place & filled.

Update on the test in staging:

  • Using 8m chunks, compression approached 5% on at least one of the nodes when compacted to two SSTables.
  • With 8g heap and a 32m region size, GC has been keeping up with low & moderate concurrency dumps (concurrency 25, ~150 req/s).
  • The dump is now running with concurrency 50. CPU saturates occasionally, but so far GC is doing okay. If things still look okay tonight, we can keep this dump running over the weekend.

Nice; It continues to look promising.

If stability with 8g heap & 32m region size is confirmed, key_rev_value buckets with large blocks will be a viable option for old-revision storage.

Given the limited throughput costs, we can consider using brotli with smaller block sizes in production. We can then gradually ramp up block sizes, and consider going all the way to 8m once we have a key_value bucket for current revisions in place & filled.

Yeah, I'm -1 on going to production with settings more aggressive memory-wise (i.e.strategy chunk_length_kb, or GC). A new compression strategy (and one that hasn't been battle-tested, to boot), carries enough risk, and introduces a lot of variability as it is. And, any timeline for ramping up chunk_length_kb (and the corresponding GC settings needed to compensate) needs to occur during a period when the cluster is under well-understood, stable, steady-state conditions. This precludes doing it while expansions are underway (T119935, T125842, and T95253), where streaming and compaction activity would make it difficult to reason about cause and effect.

That said, if we feel we have collected enough data to satisfy the question of what is possible, then I'd like to start developing a testing methodology for staging using production-like settings (as close as possible given what we have to work with). I'm sure SRE has questions here, so getting consensus from them on testing is important as well (/cc @fgiunchedi).

A full dump run at concurrency 50 has finished without issues. As a further stress test, I have now started another dump run, this time with concurrency 100.

I'm -1 on going to prod with Brotli before we finish the cluster expansion and complete the multi-instance set-up. I think we should be looking at Brotli as a long-term strategical move, not a short-term effort to reduce storage growth. I don't feel adventurous enough to introduce yet another variable in the equation at this point.

The dump with concurrency 100 did not uncover any issues over the last two days.

@Eevans, @fgiunchedi: I'm done testing for now, so the cluster is yours to perform further testing.

Regarding the deployment timeline, lets consider our options once we have an idea of how long the conversion to a multi-instance setup is going to take. See T95253: Finish conversion to multiple Cassandra instances per hardware node.

Mentioned in SAL [2016-03-02T21:12:53Z] <urandom> forcing a major compaction on {local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ,local_group_wikipedia_T_parsoid_html}.data, xenon.eqiad.wmnet : T125906

Mentioned in SAL [2016-03-02T22:28:01Z] <urandom> enabling brotli compression on local_group_wikipedia_T_parsoid_html.data in staging, and forcing rewrite of corresponding tables on xenon : T125906

Change 275917 had a related patch set uploaded (by Eevans):
(temporarily) enable thrift rpc in staging

https://gerrit.wikimedia.org/r/275917

Change 275917 merged by Filippo Giunchedi:
(temporarily) enable thrift rpc in staging

https://gerrit.wikimedia.org/r/275917

Update: Changes to Cassandra's ICompressor interface in 2.2.5 will require some changes to the experimental implementation.

The Good News though, is that jbrotli does support direct byte buffers, and if the LZ4 results are any indication, we stand to see significant performance improvements.

Update: Changes to Cassandra's ICompressor interface in 2.2.5 will require some changes to the experimental implementation.

The Good News though, is that jbrotli does support direct byte buffers, and if the LZ4 results are any indication, we stand to see significant performance improvements.

A WFM implementation can be found in the cass2.2 branch.

GWicke edited projects, added Services (attic); removed Services.