Page MenuHomePhabricator

RFC: Chunked storage algorithms for archival data vs. large-window brotli compression
Closed, DeclinedPublic

Description

Parameters

The Compression algorithm block size is perhaps the most important parameter for our chunking strategy. Currently, our only option is deflate with a relatively small 32k window, but we could add brotli compression support in order to support window sizes up to 16m (see T93496). To benefit from delta compression, we need to group chunks of a size smaller than the compression algorithm's window size.

The second critical factor in our design is overheads per IO / Cassandra request. At a relatively small chunk size of 32k, reading the 1.2mb of HTML for large pages like Barack Obama will result in about 37 IOs. While this isn't necessarily bad for latency of individual requests, it will very likely reduce the throughput that can be sustained by a given cluster. While this might not matter so much for archival use cases with relatively low request rates, it is likely that a larger chunk size would provide a better trade-off between latency and throughput.

Chunking strategies

The simplest chunking strategy cuts up an input into a sequence of fixed-size chunks. The disadvantage of this scheme is that small changes early in a document will modify all following chunks, which in turn makes it necessary to store new versions of all those chunks.

Content like HTML offers semantic boundaries like <section> tags (see T114072), which we can exploit to maintain alignment of chunks despite small changes. To provide some slack for small changes, chunks should have a minimum and maximum size. For example, with a maximum chunk size of 32k it might make sense to enforce a minimum chunk size of perhaps 24k. If one or more natural elements like <section>s can fill a chunk to between 24 and 32k size, then that alignment is used. If 24k is not reached & the following natural <section> would be larger than 32, it is cut up at the maximum byte size, reverting to a byte-based chunking.

Storing only changed chunks

In either chunking strategy it is likely that many / most chunks won't change at all between versions. We can save significant storage size and compaction load by sharing such chunks between successive versions. On write, we'd read all chunks for the previous version & compare them to the new chunks. Unchanged chunks are then skipped, and only changed chunks actually written.

This brings up the question of how we should best index chunks. We would like to avoid some of the very wide rows we have seen with very frequently edited articles. A partition / hash key of (title, chunk_number) would nicely distribute chunks for large pages across the cluster. The most obvious choice for a range key to provide time-based clustering is a tid, resulting in an index definition of:

- property: title
  type: hash
- property: chunk_number
  type: hash
- property: tid
  type: range

Consistency

Writing multiple chunks is not an atomic operation. Partial failures are possible anywhere along the way. We need to make sure that no matter what happens, readers still see a consistent state, rather than a mix of old & new chunks.

A standard way to ensure this is to atomically update a manifest entry point listing the precise tids of each chunk *after* all chunks have been successfully written. Reads then first retrieve the manifest, followed by the precise chunk versions listed in the manifest.

This introduces an additional round-trip for the manifest read in the read path. We could avoid this by using a technique similar to RAMP transactions:

Writes

  1. Add a field pointing to a transaction record in each chunk, and write it with each in-progress chunk. Add an is_last boolean to the last chunk.
  2. Mark the transaction record as 'committed' after writing all chunks. We could take a small risk & go with a simple write for this. If we want to be sure this is atomic we'd use CAS.
  3. Mark all newly-written chunks as committed.

Reads

  1. Parallel read of
    1. Render metadata, and
    2. The first chunk at desired time, using the tid field. Start some speculative read-ahead for subsequent chunks to achieve a reasonable amount of parallelism.
  2. If any of the chunks are not marked as committed, check the pointed-to transaction record. a) If the transaction record indicates that the transaction committed, mark the chunk as committed as well & use it. b) If the transaction is not committed / the transaction record is missing after being GC'ed, delete the chunk and read the next older chunk. Go to 1). c) If the transaction is not committed & recent, just read the next older chunk & repeat from 1)

The main advantage of this scheme is getting the first byte in a single Cassandra round-trip.

The main downside is added complexity in both read & write paths.

In either scheme, we should store & validate the exact chunk tids used in the render, the content length & content-type. We could also consider storing & validating a checksum, although the cost for this would be higher.

Retention policies & deletions

Retention policies would only delete chunks that aren't needed any more. This is slightly complicated by the lack of explicit revision information. Instead, the base tid of the current revision would need to be used as a lower bound to identify renders, with the chunks used in the latest render as the exclusive upper bound.

Listings

We currently provide some listings for renders (example). In the current storage layout, these don't scale very well. Listings for pages with a lot of renders / revisions routinely time out after two seconds.

Using compact render metadata doubling as transaction records for these listings would be a lot more efficient.

Related Objects

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.

Over the last couple of days I've implemented a simple prototype for a chunked storage and got some real data.

I've tried a couple of approaches, but let's start from good news: we can get compression ration down to 3.5% for html. The algorithm is fairly simple: we just slice up the article to 31k chunks and store them in the chunks table with the following structure:

attributes: {
            key: opts.keyType || 'string',
            chunk_id: 'int',
            tid: 'timeuuid',
            value: opts.valueType || 'blob',
        },
        index: [
            { attribute: 'key', type: 'hash' },
            { attribute: 'chunk_id', type: 'hash' },
            { attribute: 'tid', type: 'range', order: 'desc' }
        ]

using deflate compression with 1024kb block size. The data in chunks is fairly repetitive, so compression also picks up repetitions fairly well. For Barack_Obama article this gives us 3.8% compression ratio, which is almost 5 times better, then baseline 16.9% we have when using a naive key_rev_value storage. For wikitext the difference is not that huge, but we're still winning around 3 times. Latency and throughput are obviously affected: on my machine I have 70req/s when fetching an article from key_rev_value and only about 40req/s with chunked_bucked.

At first, we've had an idea to chunk by sections, so that if the section is untouched, it could be reused. However, parsoid html doesn't give stable enough results to make that work. Also, sectioning runs into a different problem: when some small piece of content is inserted to a section, this might make other sections shift by one, so the compression algorithms stop picking all those repetitions. The best result I could achieve for Obama with sectioning is ~9% compression ratio for html. For wikitext this approach works better (11% vs 32%) as sections are often reused, but naive chunking approach gives the same result, so it doesn't make sense to follow a harder path that gives the same result.

Over the last couple of days I've implemented a simple prototype for a chunked storage and got some real data.

I've tried a couple of approaches, but let's start from good news: we can get compression ration down to 3.5% for html. The algorithm is fairly simple: we just slice up the article to 31k chunks and store them in the chunks table with the following structure:
...
using deflate compression with 1024kb block size. The data in chunks is fairly repetitive, so compression also picks up repetitions fairly well. For Barack_Obama article this gives us 3.8% compression ratio, which is almost 5 times better, then baseline 16.9% we have when using a naive key_rev_value storage.

This sounds great. One question: is this with inlined data-mw or with data-mw stripped out?

@ssastry This is for exactly the same html that we store in RESTBase now.

We have ideas how to improve this further when data-mw will be stripped out. Right now some of the data-mw attributes are not stable enough to support reusing chunks between revisions, so we rely on the compression algorithm to pick up the repetitions. If we strip that out, we expect to improve the compression ratio even more.

@ssastry This is for exactly the same html that we store in RESTBase now.

We have ideas how to improve this further when data-mw will be stripped out. Right now some of the data-mw attributes are not stable enough to support reusing chunks between revisions, so we rely on the compression algorithm to pick up the repetitions. If we strip that out, we expect to improve the compression ratio even more.

Okay, great .. yes, that is what I was wondering .. if you were storing inlined-data-mw html and if so, how these ratios would change when data-mw is moved out.

Today I've run a bigger experiment: following the recent changes on en.wikipedia, I've loaded 500 latest revisions for changed titles both to the chunked_bucket and key_rev_value. Totally I've loaded 97081 revisions of various articles, so the sample should show results close to what we would have in production.

Here're the compression results:

key_rev_valuechunked_bucket
Stored data size2.4 Gb494 Mb
Compression ratio0.16590.0367

So, we've got almost 5 times improvement. Also, an interesting piece of data: I've got 500593 chunks stored, so average number of chunks per revision is 5.15.

Also, I've run some performance tests with a filled-up storage:

chunked_bucket, 11 chunks articlechunked_bucket, 3 chunks articlekey_rev_value, 11 chunks articlekey_rev_value, 3 chunks article
Requests per second [#/sec] (mean)130.70315.09318.71540.77
Time per request [ms] (mean)153.01963.47562.75336.984

So, we have 2 times degradation in throughput/latency, and 5 times improvement in compression.

@Pchelolo, those are really encouraging results, especially at such an early stage. Great work!

I'll see about adding streaming response support in RESTBase, which should help reduce time to first byte & might help throughput as well.

After @Eevans implemented brotli compression for cassandra, I've run a new set of tests and got some more data:

  1. Random sample of recent changes.

100 articles, max 500 revisions per article, average article size: 140k, uncompressed data size: ~1Gb
Results for benchmarking Thermonuclear_weapon, Size: 240k

AlgorithmCompression RatioThroughputLatency
key_rev_value, deflate 10240.1744294.06 req/s68.013ms
key_rev_value, brotli-1, lgblock 4k, chunk_length_kb 4k0.0173282.88 req/s70.700ms
key_rev_large_value, chunk 90k, brotli-1, lgblock 4k, chunk_length_kb 4k0.0121234.13 req/s85.422ms
key_rev_large_value, chunk 31k, deflate 10240.0289147.16 req/s135.90ms
  1. Results for benchmarking Obama:
AlgorithmCompression RatioThroughputLatency
key_rev_value, deflate 10240.1703121.91 req/s164.060ms
key_rev_value, brotli-1, lgblock 4k, chunk_length_kb 4k0.0791127.72 req/s156.598ms
key_rev_value, brotli-1, lgblock 8k, chunk_length_kb 8k0.0536132.36 req/s151.106ms
key_rev_large_value, chunk 90k, brotli-1, lgblock 4k, chunk_length_kb 4k0.012923.45 req/s852.971ms
key_rev_large_value, chunk 31k, deflate 10240.032220.89 req/s957.486ms

Based on the results above I would expect larger chunk sizes (256k? 512k?) to do better on throughput, while not losing much on compression ratio.

An issue we have seen with very huge partitions in production (T94121) is high latency and in some cases inability of accessing rows for updates. This was only with at most six month's worth of data, so the problem will be a lot more pressing with a full history. It is likely that the overall size of the partition played a major role in this issue. Generally, having partitions with a largeish number of small rows is not known to be a major performance issue in Cassandra, and the number of rows wasn't even so large in our case; the byte size, however, was.

The key_rev_large_value strategy of using a meta table indirection reduces this issue to some degree by keeping the meta partition & per-chunk partitions reasonably small. There are also further optimizations possible through the meta table, like bucketing chunks to bound the chunk partition sizes.

For those reasons I think it's worth investigating chunked storage a bit further. I can look into streaming responses tomorrow.

@Pchelolo, are your test tools available somewhere?

The test script can be found here: https://gist.github.com/Pchelolo/5fefa00e5f1047bb8a41

To use it RESTBase code should be modified a little bit, to not block direct sys requests. And we need to create the buckets somewhere.

Some more results with Obama:

Bucket typeCompressionC. block sizeC. ratioread throughput
key_rev_valuedeflate256k17.3%146 req/s
key_rev_valuedeflate1024k17.0%149 req/s
key_rev_valuebrotli 032m3.9%OOM at default z#concurrency 20 / 2G heap; 192 req/s at concurrency 10
key_rev_valuebrotli 116m3.99%204 req/s
key_rev_valuebrotli 416m2.66%190 req/s
key_rev_valuebrotli 616m1.63%204 req/s
key_rev_valuebrotli 916m1.57%219 req/s
key_rev_valuebrotli 98m2.72%202 req/s
key_rev_large_value, 31k chunksdeflate512k~4%23 req/s
key_rev_large_value, 1m chunksbrotli 18m3.27%94 req/s
key_rev_large_value, 2m chunksbrotli 116m3.99%196 req/s
key_rev_large_value, 1m chunksbrotli 116m2.5%80 req/s
key_rev_large_value, 1.8m chunksbrotli 616m1.69%194 req/s
key_rev_large_value, 1.8m chunksbrotli 916m1.63%203 req/s

Some comments:

Brotli levels

Medium to high brotli levels bring significant improvements when using large compression blocks. They are a win for read performance in any case, at a 2x - 3x increased compression CPU usage. Compression levels 10 and 11 should not be used, as compression speeds fall off a cliff & even decompression suffers badly. 9 is the sweet spot for decompression.

Compression block sizes

While 16m compression block sizes provide some improved compression over 8m sizes by potentially picking up more repetitions of the same pattern, is is not clear that the cost in terms of IO is worth it. 8m might be close to a sweet spot, with read sizes of around 70kb when using mixed content (compression ratio 1.2%). 16m blocks at a compression ratio of 0.87% result in a 142k read per compression block.

Edit: In the read size / throughput graph for the SSDs we are using, 128k is where throughput plateaus. It is possible that 16m compression blocks squeeze in just below that threshold with a mixed data set.

Chunked storage vs. key_rev_value

Chunked storage is relatively competitive if the number of chunks work out to one. A chunk size of 1.8m or 2m would make this true for all but the most extreme pages. We should perhaps perform some tests with some of the super wide rows listed in T94121, to see if a regular key_rev_value with vastly improved compression ratio would solve the scalability problem.

Another option worth investigating is the previous idea of storing metadata and a limited amount of content in a first chunk, and storing the remainder of overly large pages in overflow chunks. This might provide a good middle ground between scalability and smaller-scale performance.

Today I played a bit with nodetool scrub timings of different brotli settings vs. deflate. The (somewhat surprising) results for parsoid_html with a few GB of revisions:

CompressionBlock sizetime nodetool scrub local_group_default_T_parsoid_html dataCompression ratio
deflate16m420s23.6%
deflate256k507s
brotli 08m53s
brotli 016m50s1.37%
brotli 38m53s
brotli 316m49s1.31%
brotli 516m104s1.09%
brotli 68m119s
brotli 616m102s1.07%
brotli 916m158s1.03%

The surprising bit is that brotli re-compacts the SSTable faster than deflate, even at compression level 9.

Results from T125906 so far show decent stability and performance with brotli 8m compression blocks. A heap of 8g & G1HeapRegionSize of 32 was needed to avoid OOMs during heavy read activity.

While more testing is needed (especially around the general performance impact of 32m regions), it seems clear that key_rev_value buckets combined with large-block brotli compression are a viable option for archival storage. Compared to chunked storage, strong points are

  1. easy migration of existing data by switching key_rev_value buckets to brotli compression, and
  2. basically no code changes needed beyond the compression library in Cassandra.

Chunked storage can potentially achieve even higher compression ratios, and can do so even for extremely large pages. However, the data collected so far suggests that brotli gets us far enough for a fraction of the cost.

GWicke renamed this task from RFC: Chunked storage algorithms for archival data to RFC: Chunked storage algorithms for archival data vs. large-window brotli compression.Feb 16 2016, 8:36 PM

An illustration of the listing scaling issue is https://en.wikipedia.org/api/rest_v1/page/html/San_Francisco/, which currently times out reliably.

@Eevans @mobrovac Do we have any plans to use brotly any more? Should we decline this?

@Eevans @mobrovac Do we have any plans to use brotly any more? Should we decline this?

Brotli compression is interesting, but I do not believe this is a priority for us to work on. If anyone disagrees, feel free to re-open.