Our data model for HTML content does not distinguish between low-latency high-volume access to current revisions and long-term archival. This leaves some room for optimization for each of those two use cases.
Compression ratios for HTML content are currently at around 16% of the input size. Since the changes between revisions are actually small, ratios in the low single-digit percent ought to be possible. The main issue preventing this for HTML is currently the deflate window size of 32k not picking up repetitions between revisions of articles larger than 32k, which are relatively common. We could add brotli support to Cassandra to get larger windows, but to exploit this for whole-article storage we would then need to use extremely large input block sizes to pick up a decent number of repetitions. This in turn would likely make reads slower and more memory intense.
A better option to reduce storage needs could be to chunk content, ideally in alignment with semantic blocks like top-level sections in HTML. If these chunks are smaller than the compression algorithm's window size (32k for deflate), then it will pick up repetitions between chunks. Additionally, most edits only affect a single chunk in a large document. We can skip adding new versions of unchanged chunks altogether, which also reduces the write load. The first chunk should normally load more quickly than an entire document, reducing the time to first byte. There is also growing demand for section-based content loading at the API level, which can be efficiently supported by storing the content in sections in the first place.
The deflate / gzip window size of 32k is likely smaller than what we'd pick for an optimal trade-off between number of IOs and compression block size needed to pick up a decent number of repetitions. Adding Brotli support in Cassandra can give us a wider range of options for this trade-off. However, we can get started with deflate & leverage Brotli in a later iteration.
Another consideration is a separation of hot from cold storage, so that we can replicate hot data to the edge, but keep cold archival data only in two DCs and possibly on more density-optimized hardware. We can do this relatively easily by storing current revisions in a key-value bucket in addition to archival storage. Within this bucket, we can store each revision as an individually gzip compressed blob, ready to be streamed to the client without any de/re-compression overheads. The main gains in this scheme should come from the lack of extra reads and computation in Cassandra, as well as avoiding the need to compress data on the way out.
Schema sketch
title_revision_restriction table
Stores only restrictions on pages / revisions. This will be a very sparse table with no or very few entries for most pages. As a result, it will easily fit in page cache, making queries consistently fast.
- index: (title, rev)
- static column: page_deleted (tid)
- non-key attributes: other per-revision restrictions
Current revision key-value bucket
Stores current revisions for each title, with gzip -9 compression per blob & no cassandra compression. Reads don't involve decompressing / re-compressing content. The etag header contains the revision & tid.
Archival: key_rev_value bucket, possibly a chunked_key_rev_value bucket (T122028)
Stores the current *and* old versions of pages. Storage is density-optimized, possibly at the cost of some read latency.
Read path for /{title}
In parallel,
- query title_revision_restriction for the title, limit 1
- query current revision key-value bucket for the content
If any restrictions were returned, check if
a) the page is marked as deleted (static column), or
b) the restriction revision matches the blob revision.
Else, return the content.
Read path for /{title}/{revision}
In parallel,
- query title_revision_restriction for the title and revision, limit 1
- query the key_rev_value bucket for the title / revision.
If restrictions are returned, check if
a) the page is marked as deleted (static column), or
b) the restriction revision matches the blob revision.
Else, return the content.