We use deflate compression on tables storing HTML and data-parsoid. With its 32kb sliding window deflate picks up a lot of the repetition that is inherent in revisions of the same page, where only a small part of the page changes on each edit. Combined with a 256k block size this results in compression ratios of currently ~15% for data-parsoid and ~17% for html.
While the majority of pages by title count have HTML and data-parsoid that is smaller than the deflate 32kb window, this might not be true by size. It is thus likely that we could get significantly better compression ratios (possibly 2x) if we used a compression algorithm with a window closer to the block size.
LZMA with settings equivalent to xz -1 uses a window size of 2mb, so would fit that bill. In simple commandline-based benchmarks using xz -1 and gzip on multi-revision HTML input it seems to outperform deflate in both compression ratio and compression / decompression times. It might thus be worth investigating adding LZMA support to cassandra.