Page MenuHomePhabricator

Improve revision compression in Cassandra / Brotli or LZMA support
Closed, DeclinedPublic

Description

We use deflate compression on tables storing HTML and data-parsoid. With its 32kb sliding window deflate picks up a lot of the repetition that is inherent in revisions of the same page, where only a small part of the page changes on each edit. Combined with a 256k block size this results in compression ratios of currently ~15% for data-parsoid and ~17% for html.

While the majority of pages by title count have HTML and data-parsoid that is smaller than the deflate 32kb window, this might not be true by size. It is thus likely that we could get significantly better compression ratios (possibly 2x) if we used a compression algorithm with a window closer to the block size.

LZMA with settings equivalent to xz -1 uses a window size of 2mb, so would fit that bill. In simple commandline-based benchmarks using xz -1 and gzip on multi-revision HTML input it seems to outperform deflate in both compression ratio and compression / decompression times. It might thus be worth investigating adding LZMA support to cassandra.

See also:

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added subscribers: GWicke, Eevans, mobrovac.
GWicke renamed this task from Implement LZMA compression support for Cassandra to Improve revision compression in Cassandra / LZMA support.Mar 21 2015, 10:59 PM
GWicke set Security to None.
GWicke triaged this task as Medium priority.

With regard to LZMA libraries for Java, there appears to be two choices, the LZMA SDK from 7zip.org, and XZ for Java. Of the two, XZ for Java seems the better choice, (more actively developed, does not employ JNI and native libraries, etc).

I did some basic testing to compare LZMA with Deflate using the {Input,Output}Stream interfaces of XZ for Java, and java.util.zip (which uses the same underlying code that Cassandra's DeflateCompressor does).

The compression tests perform N tests per run, where each test appends an additional copy of the input document to the working copy as a crude approximation of the repetition inherent in a sequence of document revisions. Where N is 4, a run will perform 4 test compressions, <input>, <input>+<input>, <input>+<input>+<input>, and <input>+<input>+<input>+<input>. The LZMA dictionary size for the tests was set to 2MB, roughly 2x the largest test input of BarackObama. The remainder of the settings match preset level 1, and a nice length of 8 bytes. Deflate settings match those used in Cassandra's ICompressor implementation (FASTEST).

The decompression tests simply iterate over the persisted data from a corresponding test compression, recording the elapsed decompression time of each.

Both the compression and decompression tests perform 26 sequential runs, 25 to warm up the JVM, and the 26th to use as the results.

Disclaimer: This is far from a perfect benchmark; The intention here is to determine if this warrants implementing a complete ICompressor for further testing.

LZMA Compression, Barack_Obama:

Compressed barack_obama.x1 (1068901 bytes) to 131060 bytes (0.1226) in 53.80482600 ms
Compressed barack_obama.x2 (2137802 bytes) to 131060 bytes (0.0613) in 73.34051700 ms
Compressed barack_obama.x3 (3206703 bytes) to 183359 bytes (0.0572) in 89.38178700 ms
Compressed barack_obama.x4 (4275604 bytes) to 183359 bytes (0.0429) in 105.90887900 ms
Compressed barack_obama.x5 (5344505 bytes) to 183664 bytes (0.0344) in 125.11384700 ms
Compressed barack_obama.x6 (6413406 bytes) to 183664 bytes (0.0286) in 142.09250300 ms
Compressed barack_obama.x7 (7482307 bytes) to 183969 bytes (0.0246) in 156.88575000 ms
Compressed barack_obama.x8 (8551208 bytes) to 183969 bytes (0.0215) in 174.76391100 ms

GZIP Compression, Barack_Obama

Compressed barack_obama.x1 (1068901 bytes) to 189198 bytes (0.1770) in 24.81865900 ms
Compressed barack_obama.x2 (2137802 bytes) to 378829 bytes (0.1772) in 49.84685300 ms
Compressed barack_obama.x3 (3206703 bytes) to 568569 bytes (0.1773) in 75.34463800 ms
Compressed barack_obama.x4 (4275604 bytes) to 758399 bytes (0.1774) in 99.81238300 ms
Compressed barack_obama.x5 (5344505 bytes) to 948211 bytes (0.1774) in 124.71964800 ms
Compressed barack_obama.x6 (6413406 bytes) to 1137631 bytes (0.1774) in 150.47960400 ms
Compressed barack_obama.x7 (7482307 bytes) to 1327066 bytes (0.1774) in 177.98831000 ms
Compressed barack_obama.x8 (8551208 bytes) to 1516912 bytes (0.1774) in 200.69609200 ms

LZMA Decompression, Barack_Obama

Decompressed barack_obama.x1.LZMA in 19.53259600 ms (1068901 bytes)
Decompressed barack_obama.x2.LZMA in 24.13829800 ms (2137802 bytes)
Decompressed barack_obama.x3.LZMA in 27.80673700 ms (3206703 bytes)
Decompressed barack_obama.x4.LZMA in 32.45574700 ms (4275604 bytes)
Decompressed barack_obama.x5.LZMA in 35.83678000 ms (5344505 bytes)
Decompressed barack_obama.x6.LZMA in 40.19839200 ms (6413406 bytes)
Decompressed barack_obama.x7.LZMA in 45.12875100 ms (7482307 bytes)
Decompressed barack_obama.x8.LZMA in 47.98506200 ms (8551208 bytes)

GZIP Decompression, Barack Obama

Decompressed barack_obama.x1.GZIP in 4.47076300 ms (1068901 bytes)
Decompressed barack_obama.x2.GZIP in 8.53181700 ms (2137802 bytes)
Decompressed barack_obama.x3.GZIP in 12.32709800 ms (3206703 bytes)
Decompressed barack_obama.x4.GZIP in 16.37447100 ms (4275604 bytes)
Decompressed barack_obama.x5.GZIP in 20.47309400 ms (5344505 bytes)
Decompressed barack_obama.x6.GZIP in 24.68755300 ms (6413406 bytes)
Decompressed barack_obama.x7.GZIP in 28.64809000 ms (7482307 bytes)
Decompressed barack_obama.x8.GZIP in 33.04704100 ms (8551208 bytes)

LZMA Compression, FooBar

Compressed foobar.x1 (58691 bytes) to 15960 bytes (0.2719) in 5.87923100 ms
Compressed foobar.x2 (117382 bytes) to 16040 bytes (0.1366) in 6.72729000 ms
Compressed foobar.x3 (176073 bytes) to 16051 bytes (0.0912) in 8.41632700 ms
Compressed foobar.x4 (234764 bytes) to 16060 bytes (0.0684) in 8.46131800 ms
Compressed foobar.x5 (293455 bytes) to 16068 bytes (0.0548) in 9.51738100 ms
Compressed foobar.x6 (352146 bytes) to 16076 bytes (0.0457) in 12.37571000 ms
Compressed foobar.x7 (410837 bytes) to 16085 bytes (0.0392) in 11.00927100 ms
Compressed foobar.x8 (469528 bytes) to 16093 bytes (0.0343) in 12.23966100 ms

GZIP Compression, FooBar

Compressed foobar.x1 (58691 bytes) to 10 bytes (0.0002) in 1.31012800 ms
Compressed foobar.x2 (117382 bytes) to 25288 bytes (0.2154) in 3.31015400 ms
Compressed foobar.x3 (176073 bytes) to 25288 bytes (0.1436) in 5.03542500 ms
Compressed foobar.x4 (234764 bytes) to 51699 bytes (0.2202) in 7.06060700 ms
Compressed foobar.x5 (293455 bytes) to 51699 bytes (0.1762) in 8.61110100 ms
Compressed foobar.x6 (352146 bytes) to 78111 bytes (0.2218) in 11.23726800 ms
Compressed foobar.x7 (410837 bytes) to 104719 bytes (0.2549) in 12.81285500 ms
Compressed foobar.x8 (469528 bytes) to 104719 bytes (0.2230) in 14.37606200 ms

LZMA Decompression, FooBar

Decompressed foobar.x1.LZMA in 2.01611200 ms (58691 bytes)
Decompressed foobar.x2.LZMA in 2.30793200 ms (117382 bytes)
Decompressed foobar.x3.LZMA in 2.65616900 ms (176073 bytes)
Decompressed foobar.x4.LZMA in 2.74773900 ms (234764 bytes)
Decompressed foobar.x5.LZMA in 2.95598200 ms (293455 bytes)
Decompressed foobar.x6.LZMA in 3.26531500 ms (352146 bytes)
Decompressed foobar.x7.LZMA in 3.66266100 ms (410837 bytes)
Decompressed foobar.x8.LZMA in 3.70670000 ms (469528 bytes)

GZIP Decompression, FooBar

Decompressed foobar.x1.GZIP in 0.35499700 ms (58691 bytes)
Decompressed foobar.x2.GZIP in 0.68477100 ms (117382 bytes)
Decompressed foobar.x3.GZIP in 1.01138600 ms (176073 bytes)
Decompressed foobar.x4.GZIP in 1.32426300 ms (234764 bytes)
Decompressed foobar.x5.GZIP in 1.51298200 ms (293455 bytes)
Decompressed foobar.x6.GZIP in 1.84860300 ms (352146 bytes)
Decompressed foobar.x7.GZIP in 2.13760400 ms (410837 bytes)
Decompressed foobar.x8.GZIP in 2.80724900 ms (469528 bytes)

@Eevans, those are interesting results. Decompression looks surprisingly slow. I wonder if reducing the window size to 512k or 1m would help there.

Could you upload your code, so that we can try a couple more parameter combinations?

LZMA Compression, Barack_Obama, 512K DictSize:

Compressed barack_obama.x1 (1068901 bytes) to 170516 bytes (0.1595) in 62.93616000 ms
Compressed barack_obama.x2 (2137802 bytes) to 337470 bytes (0.1579) in 137.77425000 ms
Compressed barack_obama.x3 (3206703 bytes) to 504352 bytes (0.1573) in 201.88907900 ms
Compressed barack_obama.x4 (4275604 bytes) to 671210 bytes (0.1570) in 249.72651700 ms
Compressed barack_obama.x5 (5344505 bytes) to 838037 bytes (0.1568) in 316.23188900 ms
Compressed barack_obama.x6 (6413406 bytes) to 1004859 bytes (0.1567) in 343.87962100 ms
Compressed barack_obama.x7 (7482307 bytes) to 1171661 bytes (0.1566) in 412.14118600 ms
Compressed barack_obama.x8 (8551208 bytes) to 1338465 bytes (0.1565) in 492.30683200 ms

LZMA Decompression, Barack_Obama, 512K DictSize:

Decompressed barack_obama.x1.LZMA in 21.98755300 ms (1068901 bytes)
Decompressed barack_obama.x2.LZMA in 39.82396400 ms (2137802 bytes)
Decompressed barack_obama.x3.LZMA in 62.85033000 ms (3206703 bytes)
Decompressed barack_obama.x4.LZMA in 81.93743200 ms (4275604 bytes)
Decompressed barack_obama.x5.LZMA in 100.43536000 ms (5344505 bytes)
Decompressed barack_obama.x6.LZMA in 125.30379300 ms (6413406 bytes)
Decompressed barack_obama.x7.LZMA in 142.36573300 ms (7482307 bytes)
Decompressed barack_obama.x8.LZMA in 162.73874300 ms (8551208 bytes)

LZMA Compression, Barack_Obama, 1M DictSize:

Compressed barack_obama.x1 (1068901 bytes) to 183161 bytes (0.1714) in 49.45814200 ms
Compressed barack_obama.x2 (2137802 bytes) to 359115 bytes (0.1680) in 103.66387600 ms
Compressed barack_obama.x3 (3206703 bytes) to 535010 bytes (0.1668) in 152.99017900 ms
Compressed barack_obama.x4 (4275604 bytes) to 710866 bytes (0.1663) in 225.39870800 ms
Compressed barack_obama.x5 (5344505 bytes) to 886712 bytes (0.1659) in 276.20910500 ms
Compressed barack_obama.x6 (6413406 bytes) to 1062544 bytes (0.1657) in 323.17625900 ms
Compressed barack_obama.x7 (7482307 bytes) to 1238356 bytes (0.1655) in 383.19363400 ms
Compressed barack_obama.x8 (8551208 bytes) to 1414169 bytes (0.1654) in 442.07934000 ms

LZMA Decompression, Barack_Obama, 1M DictSize:

Decompressed barack_obama.x1.LZMA in 19.61151100 ms (1068901 bytes)
Decompressed barack_obama.x2.LZMA in 39.46507300 ms (2137802 bytes)
Decompressed barack_obama.x3.LZMA in 57.78714800 ms (3206703 bytes)
Decompressed barack_obama.x4.LZMA in 78.10525300 ms (4275604 bytes)
Decompressed barack_obama.x5.LZMA in 93.25809600 ms (5344505 bytes)
Decompressed barack_obama.x6.LZMA in 116.73230800 ms (6413406 bytes)
Decompressed barack_obama.x7.LZMA in 133.74977300 ms (7482307 bytes)
Decompressed barack_obama.x8.LZMA in 151.66453000 ms (8551208 bytes)

LZMA Compression, FooBar, 512K DictSize:

Compressed foobar.x1 (58691 bytes) to 15961 bytes (0.2719) in 4.28349800 ms
Compressed foobar.x2 (117382 bytes) to 16042 bytes (0.1367) in 5.31875800 ms
Compressed foobar.x3 (176073 bytes) to 16052 bytes (0.0912) in 6.32154300 ms
Compressed foobar.x4 (234764 bytes) to 16061 bytes (0.0684) in 8.83761800 ms
Compressed foobar.x5 (293455 bytes) to 16069 bytes (0.0548) in 8.00163400 ms
Compressed foobar.x6 (352146 bytes) to 16078 bytes (0.0457) in 8.54309700 ms
Compressed foobar.x7 (410837 bytes) to 16086 bytes (0.0392) in 9.86812000 ms
Compressed foobar.x8 (469528 bytes) to 16094 bytes (0.0343) in 10.71924500 ms

LZMA Decompression, FooBar, 512K DictSize:

Decompressed foobar.x1.LZMA in 1.91265700 ms (58691 bytes)
Decompressed foobar.x2.LZMA in 2.11635800 ms (117382 bytes)
Decompressed foobar.x3.LZMA in 2.32348700 ms (176073 bytes)
Decompressed foobar.x4.LZMA in 3.75445400 ms (234764 bytes)
Decompressed foobar.x5.LZMA in 2.83085700 ms (293455 bytes)
Decompressed foobar.x6.LZMA in 3.35945800 ms (352146 bytes)
Decompressed foobar.x7.LZMA in 4.38254500 ms (410837 bytes)
Decompressed foobar.x8.LZMA in 3.53572400 ms (469528 bytes)

LZMA Compression, FooBar, 1M DictSize:

Compressed foobar.x1 (58691 bytes) to 15960 bytes (0.2719) in 5.09642600 ms
Compressed foobar.x2 (117382 bytes) to 16040 bytes (0.1366) in 5.98636600 ms
Compressed foobar.x3 (176073 bytes) to 16051 bytes (0.0912) in 7.27437100 ms
Compressed foobar.x4 (234764 bytes) to 16060 bytes (0.0684) in 9.53693800 ms
Compressed foobar.x5 (293455 bytes) to 16068 bytes (0.0548) in 9.67168600 ms
Compressed foobar.x6 (352146 bytes) to 16076 bytes (0.0457) in 10.53980300 ms
Compressed foobar.x7 (410837 bytes) to 16085 bytes (0.0392) in 11.51398800 ms
Compressed foobar.x8 (469528 bytes) to 16093 bytes (0.0343) in 12.01107200 ms

LZMA Decompression, FooBar, 1M DictSize:

Decompressed foobar.x1.LZMA in 1.93766300 ms (58691 bytes)
Decompressed foobar.x2.LZMA in 2.48479200 ms (117382 bytes)
Decompressed foobar.x3.LZMA in 2.70221100 ms (176073 bytes)
Decompressed foobar.x4.LZMA in 2.78743000 ms (234764 bytes)
Decompressed foobar.x5.LZMA in 2.84856200 ms (293455 bytes)
Decompressed foobar.x6.LZMA in 3.13510500 ms (352146 bytes)
Decompressed foobar.x7.LZMA in 3.53029000 ms (410837 bytes)
Decompressed foobar.x8.LZMA in 3.48778400 ms (469528 bytes)

Compression ratios are really good when compared to deflate. Compression time is also comparable. But decompression... It's an order of magnitude slower even with different window sizes, which is a real pity given that RESTBase is more oriented towards reads.

Reads are slower, but at the sizes that are relevant for us (256k or 512k input sizes most likely) it's the same order of magnitude (1.9 vs. 3.1 ms in my test run for 512k, dict size 512k).

My main concern is about the heap pressure that's caused by the Java implementation, especially during decompression. Gzip on the other hand is native C code, so doesn't weigh on the heap as much. If we can establish that this is not an issue then I think the decompression time delta itself should be okay. If that turns out too much of an issue then using a native library via JNI might make sense. It could also be worth profiling the decompressor code a bit to see if there are low-hanging optimization fruit in there.

For comparison, I implemented similar tests of the LZMA SDK. The results that follow are using the default settings (which frustratingly do not correspond to a preset, like xz does). The compression numbers here are obviously way outside what we would need though, and I'm relatively certain that it's because this library lacks a hash-chaining match finder, (XZ for Java's performance is comparable with a b-tree mf); I find it unlikely that any amount of tuning will make this library suitable for our purposes.

Compressed barack_obama 1068901 bytes to 151026 bytes (0.1413) in 371.32125700 ms
Compressed barack_obama_x2 2137802 bytes to 151257 bytes (0.0708) in 617.57614700 ms
Compressed barack_obama_x3 3206703 bytes to 151407 bytes (0.0472) in 877.74511500 ms
Compressed barack_obama_x4 4275604 bytes to 151558 bytes (0.0354) in 1136.28482800 ms
Compressed barack_obama_x5 5344505 bytes to 151709 bytes (0.0284) in 1380.23471600 ms
Compressed barack_obama_x6 6413406 bytes to 151859 bytes (0.0237) in 1629.73898100 ms
Compressed barack_obama_x7 7482307 bytes to 152011 bytes (0.0203) in 1880.59167300 ms
Compressed barack_obama_x8 8551208 bytes to 152161 bytes (0.0178) in 2133.95139500 ms

Decompressed barack_obama.7zip in 19.82368600 ms
Decompressed barack_obama_x2.7zip in 22.22633200 ms
Decompressed barack_obama_x3.7zip in 25.84445500 ms
Decompressed barack_obama_x4.7zip in 29.30868300 ms
Decompressed barack_obama_x5.7zip in 33.18767900 ms
Decompressed barack_obama_x6.7zip in 38.82403000 ms
Decompressed barack_obama_x7.7zip in 39.86679700 ms
Decompressed barack_obama_x8.7zip in 43.68196600 ms

Reads are slower, but at the sizes that are relevant for us (256k or 512k input sizes most likely) it's the same order of magnitude (1.9 vs. 3.1 ms in my test run for 512k, dict size 512k).

Updated numbers with liblzma-jna are looking very promising:

512k block decompression, input from en:Foobar:

  • liblzma-jna: 0.9ms
  • deflate: 1.9ms
  • xz-java: 3.1ms

Great set of compression algorithm benchmarks: https://quixdb.github.io/squash-benchmark/

Brotli is showing promise by being faster than lzma -1 (and a lot faster for decompression) at similar compression ratios. Make sure to select enwiki and server hardware (Xenon) for the comparison.

The Squash abstraction library might be generally an attractive alternative to wrapping algorithms directly.

GWicke renamed this task from Improve revision compression in Cassandra / LZMA support to Improve revision compression in Cassandra / Brotli or LZMA support.Feb 8 2016, 5:39 PM
GWicke edited projects, added Services (later); removed Services.