We use deflate compression on tables storing HTML and data-parsoid. With its 32kb sliding window deflate picks up a lot of the repetition that is inherent in revisions of the same page, where only a small part of the page changes on each edit. Combined with a 256k block size this results in compression ratios of currently ~15% for data-parsoid and ~17% for html.
While the majority of pages by title count have HTML and data-parsoid that is smaller than the deflate 32kb window, this might not be true by size. It is thus likely that we could get significantly better compression ratios (possibly 2x) if we used a compression algorithm with a window closer to the block size.
LZMA with settings equivalent to `xz -1` uses a window size of 2mb, so would fit that bill. In simple commandline-based benchmarks using `xz -1` and `gzip` on multi-revision HTML input it seems to outperform deflate in both compression ratio //and// compression / decompression times. It might thus be worth investigating adding LZMA support to cassandra.