The Compression algorithm block size is perhaps the most important parameter for our chunking strategy. Currently, our only option is deflate with a relatively small 32k window, but we could add brotli compression support in order to support window sizes up to 16m (see T93496). To benefit from delta compression, we need to group chunks of a size smaller than the compression algorithm's window size.
The second critical factor in our design is overheads per IO / Cassandra request. At a relatively small chunk size of 32k, reading the 1.2mb of HTML for large pages like [Barack Obama](https://en.wikipedia.org/api/rest_v1/page/html/Barack_Obama) will result in about 37 IOs. While this isn't necessarily bad for latency of individual requests, it will very likely reduce the throughput that can be sustained by a given cluster. While this might not matter so much for archival use cases with relatively low request rates, it is likely that a larger chunk size would provide a better trade-off between latency and throughput.
## Chunking strategies
The simplest chunking strategy cuts up an input into a sequence of fixed-size chunks. The disadvantage of this scheme is that small changes early in a document will modify all following chunks, which in turn makes it necessary to store new versions of all those chunks.
Content like HTML offers semantic boundaries like <section> tags (see T114072), which we can exploit to maintain alignment of chunks despite small changes. To provide some slack for small changes, chunks should have a minimum and maximum size. For example, with a maximum chunk size of 32k it might make sense to enforce a minimum chunk size of perhaps 24k. If one or more natural elements like <section>s can fill a chunk to between 24 and 32k size, then that alignment is used. If 24k is not reached & the following natural <section> would be larger than 32, it is cut up at the maximum byte size, reverting to a byte-based chunking.
## Storing only changed chunks
In either chunking strategy it is likely that many / most chunks won't change at all between versions. We can save significant storage size and compaction load by sharing such chunks between successive versions. On write, we'd read all chunks for the previous version & compare them to the new chunks. Unchanged chunks are then skipped, and only changed chunks actually written.
This brings up the question of how we should best index chunks. We would like to avoid some of the very wide rows we have seen with very frequently edited articles. A partition / hash key of (title, chunk_number) would nicely distribute chunks for large pages across the cluster. The most obvious choice for a range key to provide time-based clustering is a tid, resulting in an index definition of:
- property: title
- property: chunk_number
- property: tid
Writing multiple chunks is not an atomic operation. Partial failures are possible anywhere along the way. We need to make sure that no matter what happens, readers still see a consistent state, rather than a mix of old & new chunks.
A standard way to ensure this is to atomically update a manifest entry point listing the precise tids of each chunk *after* all chunks have been successfully written. Reads then first retrieve the manifest, followed by the precise chunk versions listed in the manifest.
This introduces an additional round-trip for the manifest read in the read path. We could avoid this by using a technique similar to [RAMP transactions](http://www.bailis.org/papers/ramp-sigmod2014.pdf):
1) Add a field pointing to a transaction record in each chunk, and write it with each in-progress chunk. Add an `is_last` boolean / checksum to the last chunk.
2) Mark the transaction record as 'committed' after writing all chunks. We could take a small risk & go with a simple write for this. If we want to be sure this is atomic we'd use CAS.
3) Mark all newly-written chunks as committed.
1) Read latest chunks at desired time, using the `tid` field. Without explicit chunk count information, we'll want to do some speculative read-ahead for subsequent chunks to achieve a reasonable amount of parallelism.
2) If any of them is not marked as committed, check the pointed-to transaction record.
a) If the transaction record indicates that the transaction committed, mark the chunk as committed as well & use it.
b) If the transaction is not committed / the transaction record is missing after being GC'ed, delete the chunk and read the next older chunk. Go to 1).
c) If the transaction is not committed & recent, just read the next older chunk & repeat from 1)
This scheme gets us the first byte in a single Cassandra round-trip. Downsides are added complexity in both read & write paths, and limited read parallelism without embarking on too much speculation.
### Retention policies & deletions
Retention policies would only delete chunks that aren't needed any more. This is slightly complicated by the lack of explicit revision information. Instead, the base `tid` of the current revision would need to be used as a lower bound to identify renders, with the chunks used in the latest render as the exclusive upper bound.
We currently provide some listings for renders ([example](http://en.wikipedia.org/api/rest_v1/page/html/San_Francisco/)). In the current storage layout, these don't scale very well. Listings for pages with a lot of renders / revisions routinely time out after two seconds.
Using compact render metadata doubling as transaction records for these listings would be a lot more efficient.