Maniphest T122028

RFC: Chunked storage algorithms for archival data vs. large-window brotli compression
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• GWicke
	Dec 21 2015, 1:18 AM

Description

Parameters

The Compression algorithm block size is perhaps the most important parameter for our chunking strategy. Currently, our only option is deflate with a relatively small 32k window, but we could add brotli compression support in order to support window sizes up to 16m (see T93496). To benefit from delta compression, we need to group chunks of a size smaller than the compression algorithm's window size.

The second critical factor in our design is overheads per IO / Cassandra request. At a relatively small chunk size of 32k, reading the 1.2mb of HTML for large pages like Barack Obama will result in about 37 IOs. While this isn't necessarily bad for latency of individual requests, it will very likely reduce the throughput that can be sustained by a given cluster. While this might not matter so much for archival use cases with relatively low request rates, it is likely that a larger chunk size would provide a better trade-off between latency and throughput.

Chunking strategies

The simplest chunking strategy cuts up an input into a sequence of fixed-size chunks. The disadvantage of this scheme is that small changes early in a document will modify all following chunks, which in turn makes it necessary to store new versions of all those chunks.

Content like HTML offers semantic boundaries like <section> tags (see T114072), which we can exploit to maintain alignment of chunks despite small changes. To provide some slack for small changes, chunks should have a minimum and maximum size. For example, with a maximum chunk size of 32k it might make sense to enforce a minimum chunk size of perhaps 24k. If one or more natural elements like <section>s can fill a chunk to between 24 and 32k size, then that alignment is used. If 24k is not reached & the following natural <section> would be larger than 32, it is cut up at the maximum byte size, reverting to a byte-based chunking.

Storing only changed chunks

In either chunking strategy it is likely that many / most chunks won't change at all between versions. We can save significant storage size and compaction load by sharing such chunks between successive versions. On write, we'd read all chunks for the previous version & compare them to the new chunks. Unchanged chunks are then skipped, and only changed chunks actually written.

This brings up the question of how we should best index chunks. We would like to avoid some of the very wide rows we have seen with very frequently edited articles. A partition / hash key of (title, chunk_number) would nicely distribute chunks for large pages across the cluster. The most obvious choice for a range key to provide time-based clustering is a tid, resulting in an index definition of:

- property: title
  type: hash
- property: chunk_number
  type: hash
- property: tid
  type: range

Consistency

Writing multiple chunks is not an atomic operation. Partial failures are possible anywhere along the way. We need to make sure that no matter what happens, readers still see a consistent state, rather than a mix of old & new chunks.

A standard way to ensure this is to atomically update a manifest entry point listing the precise tids of each chunk *after* all chunks have been successfully written. Reads then first retrieve the manifest, followed by the precise chunk versions listed in the manifest.

This introduces an additional round-trip for the manifest read in the read path. We could avoid this by using a technique similar to RAMP transactions:

Writes

Add a field pointing to a transaction record in each chunk, and write it with each in-progress chunk. Add an is_last boolean to the last chunk.
Mark the transaction record as 'committed' after writing all chunks. We could take a small risk & go with a simple write for this. If we want to be sure this is atomic we'd use CAS.
Mark all newly-written chunks as committed.

Reads

Parallel read of
1. Render metadata, and
2. The first chunk at desired time, using the tid field. Start some speculative read-ahead for subsequent chunks to achieve a reasonable amount of parallelism.
If any of the chunks are not marked as committed, check the pointed-to transaction record. a) If the transaction record indicates that the transaction committed, mark the chunk as committed as well & use it. b) If the transaction is not committed / the transaction record is missing after being GC'ed, delete the chunk and read the next older chunk. Go to 1). c) If the transaction is not committed & recent, just read the next older chunk & repeat from 1)

The main advantage of this scheme is getting the first byte in a single Cassandra round-trip.

The main downside is added complexity in both read & write paths.

In either scheme, we should store & validate the exact chunk tids used in the render, the content length & content-type. We could also consider storing & validating a checksum, although the cost for this would be higher.

Retention policies & deletions

Retention policies would only delete chunks that aren't needed any more. This is slightly complicated by the lack of explicit revision information. Instead, the base tid of the current revision would need to be used as a lower bound to identify renders, with the chunks used in the latest render as the exclusive upper bound.

Listings

We currently provide some listings for renders (example). In the current storage layout, these don't scale very well. Listings for pages with a lot of renders / revisions routinely time out after two seconds.

Using compact render metadata doubling as transaction records for these listings would be a lot more efficient.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T120171 RFC: Differentiate storage strategies for archival storage vs. hot current data
Declined	None	T122028 RFC: Chunked storage algorithms for archival data vs. large-window brotli compression
Declined	Eevans	T125906 Evaluate Brotli compression for Cassandra
Resolved	Eevans	T126629 Cassandra 2.2.6
Resolved	None	T127365 Cassandra upgrades in staging attempted to start root instance
Open	None	T137419 Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)
Resolved	Eevans	T137474 Investigate lack of recency bias in Cassandra histogram metrics
Resolved	Eevans	T139639 RESTBase: Cassandra 2.2.6 post-upgrade checklist

Event Timeline

• GWicke created this task.Dec 21 2015, 1:18 AM

• GWicke raised the priority of this task from to Medium.

• GWicke updated the task description. (Show Details)

• GWicke added a project: RESTBase.

• GWicke added subscribers: • Pchelolo, Eevans, • mobrovac, • GWicke.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2015, 1:18 AM

• GWicke updated the task description. (Show Details)Dec 21 2015, 1:23 AM

• GWicke set Security to None.

• GWicke updated the task description. (Show Details)Dec 21 2015, 1:25 AM

• GWicke updated the task description. (Show Details)Dec 21 2015, 1:39 AM

• GWicke mentioned this in T120171: RFC: Differentiate storage strategies for archival storage vs. hot current data.Jan 16 2016, 2:24 AM

Over the last couple of days I've implemented a simple prototype for a chunked storage and got some real data.

I've tried a couple of approaches, but let's start from good news: we can get compression ration down to 3.5% for html. The algorithm is fairly simple: we just slice up the article to 31k chunks and store them in the chunks table with the following structure:

attributes: {
            key: opts.keyType || 'string',
            chunk_id: 'int',
            tid: 'timeuuid',
            value: opts.valueType || 'blob',
        },
        index: [
            { attribute: 'key', type: 'hash' },
            { attribute: 'chunk_id', type: 'hash' },
            { attribute: 'tid', type: 'range', order: 'desc' }
        ]

using deflate compression with 1024kb block size. The data in chunks is fairly repetitive, so compression also picks up repetitions fairly well. For Barack_Obama article this gives us 3.8% compression ratio, which is almost 5 times better, then baseline 16.9% we have when using a naive key_rev_value storage. For wikitext the difference is not that huge, but we're still winning around 3 times. Latency and throughput are obviously affected: on my machine I have 70req/s when fetching an article from key_rev_value and only about 40req/s with chunked_bucked.

At first, we've had an idea to chunk by sections, so that if the section is untouched, it could be reused. However, parsoid html doesn't give stable enough results to make that work. Also, sectioning runs into a different problem: when some small piece of content is inserted to a section, this might make other sections shift by one, so the compression algorithms stop picking all those repetitions. The best result I could achieve for Obama with sectioning is ~9% compression ratio for html. For wikitext this approach works better (11% vs 32%) as sections are often reused, but naive chunking approach gives the same result, so it doesn't make sense to follow a harder path that gives the same result.

In T122028#1954407, @Pchelolo wrote:

Over the last couple of days I've implemented a simple prototype for a chunked storage and got some real data.

I've tried a couple of approaches, but let's start from good news: we can get compression ration down to 3.5% for html. The algorithm is fairly simple: we just slice up the article to 31k chunks and store them in the chunks table with the following structure:
...
using deflate compression with 1024kb block size. The data in chunks is fairly repetitive, so compression also picks up repetitions fairly well. For Barack_Obama article this gives us 3.8% compression ratio, which is almost 5 times better, then baseline 16.9% we have when using a naive key_rev_value storage.

This sounds great. One question: is this with inlined data-mw or with data-mw stripped out?

@ssastry This is for exactly the same html that we store in RESTBase now.

We have ideas how to improve this further when data-mw will be stripped out. Right now some of the data-mw attributes are not stable enough to support reusing chunks between revisions, so we rely on the compression algorithm to pick up the repetitions. If we strip that out, we expect to improve the compression ratio even more.

In T122028#1954630, @Pchelolo wrote:

@ssastry This is for exactly the same html that we store in RESTBase now.

We have ideas how to improve this further when data-mw will be stripped out. Right now some of the data-mw attributes are not stable enough to support reusing chunks between revisions, so we rely on the compression algorithm to pick up the repetitions. If we strip that out, we expect to improve the compression ratio even more.

Okay, great .. yes, that is what I was wondering .. if you were storing inlined-data-mw html and if so, how these ratios would change when data-mw is moved out.

• GWicke mentioned this in T122390: Is RDFa metadata in Parsoid HTML head actually useful to you / no user name & edit comment suppression in Parsoid <head> metadata.Jan 22 2016, 7:51 PM

Today I've run a bigger experiment: following the recent changes on en.wikipedia, I've loaded 500 latest revisions for changed titles both to the chunked_bucket and key_rev_value. Totally I've loaded 97081 revisions of various articles, so the sample should show results close to what we would have in production.

Here're the compression results:

	key_rev_value	chunked_bucket
Stored data size	2.4 Gb	494 Mb
Compression ratio	0.1659	0.0367

So, we've got almost 5 times improvement. Also, an interesting piece of data: I've got 500593 chunks stored, so average number of chunks per revision is 5.15.

Also, I've run some performance tests with a filled-up storage:

	chunked_bucket, 11 chunks article	chunked_bucket, 3 chunks article	key_rev_value, 11 chunks article	key_rev_value, 3 chunks article
Requests per second [#/sec] (mean)	130.70	315.09	318.71	540.77
Time per request [ms] (mean)	153.019	63.475	62.753	36.984

So, we have 2 times degradation in throughput/latency, and 5 times improvement in compression.

@Pchelolo, those are really encouraging results, especially at such an early stage. Great work!

I'll see about adding streaming response support in RESTBase, which should help reduce time to first byte & might help throughput as well.

After @Eevans implemented brotli compression for cassandra, I've run a new set of tests and got some more data:

Random sample of recent changes.

100 articles, max 500 revisions per article, average article size: 140k, uncompressed data size: ~1Gb
Results for benchmarking Thermonuclear_weapon, Size: 240k

Algorithm	Compression Ratio	Throughput	Latency
key_rev_value, deflate 1024	0.1744	294.06 req/s	68.013ms
key_rev_value, brotli-1, lgblock 4k, chunk_length_kb 4k	0.0173	282.88 req/s	70.700ms
key_rev_large_value, chunk 90k, brotli-1, lgblock 4k, chunk_length_kb 4k	0.0121	234.13 req/s	85.422ms
key_rev_large_value, chunk 31k, deflate 1024	0.0289	147.16 req/s	135.90ms

Results for benchmarking Obama:

Algorithm	Compression Ratio	Throughput	Latency
key_rev_value, deflate 1024	0.1703	121.91 req/s	164.060ms
key_rev_value, brotli-1, lgblock 4k, chunk_length_kb 4k	0.0791	127.72 req/s	156.598ms
key_rev_value, brotli-1, lgblock 8k, chunk_length_kb 8k	0.0536	132.36 req/s	151.106ms
key_rev_large_value, chunk 90k, brotli-1, lgblock 4k, chunk_length_kb 4k	0.0129	23.45 req/s	852.971ms
key_rev_large_value, chunk 31k, deflate 1024	0.0322	20.89 req/s	957.486ms

Based on the results above I would expect larger chunk sizes (256k? 512k?) to do better on throughput, while not losing much on compression ratio.

An issue we have seen with very huge partitions in production (T94121) is high latency and in some cases inability of accessing rows for updates. This was only with at most six month's worth of data, so the problem will be a lot more pressing with a full history. It is likely that the overall size of the partition played a major role in this issue. Generally, having partitions with a largeish number of small rows is not known to be a major performance issue in Cassandra, and the number of rows wasn't even so large in our case; the byte size, however, was.

The key_rev_large_value strategy of using a meta table indirection reduces this issue to some degree by keeping the meta partition & per-chunk partitions reasonably small. There are also further optimizations possible through the meta table, like bucketing chunks to bound the chunk partition sizes.

For those reasons I think it's worth investigating chunked storage a bit further. I can look into streaming responses tomorrow.

@Pchelolo, are your test tools available somewhere?

The test script can be found here: https://gist.github.com/Pchelolo/5fefa00e5f1047bb8a41

To use it RESTBase code should be modified a little bit, to not block direct sys requests. And we need to create the buckets somewhere.

@Pchelolo, thanks!

Some more results with Obama:

Bucket type	Compression	C. block size	C. ratio	read throughput
key_rev_value	deflate	256k	17.3%	146 req/s
key_rev_value	deflate	1024k	17.0%	149 req/s
key_rev_value	brotli 0	32m	3.9%	OOM at default z#concurrency 20 / 2G heap; 192 req/s at concurrency 10
key_rev_value	brotli 1	16m	3.99%	204 req/s
key_rev_value	brotli 4	16m	2.66%	190 req/s
key_rev_value	brotli 6	16m	1.63%	204 req/s
key_rev_value	brotli 9	16m	1.57%	219 req/s
key_rev_value	brotli 9	8m	2.72%	202 req/s
key_rev_large_value, 31k chunks	deflate	512k	~4%	23 req/s
key_rev_large_value, 1m chunks	brotli 1	8m	3.27%	94 req/s
key_rev_large_value, 2m chunks	brotli 1	16m	3.99%	196 req/s
key_rev_large_value, 1m chunks	brotli 1	16m	2.5%	80 req/s
key_rev_large_value, 1.8m chunks	brotli 6	16m	1.69%	194 req/s
key_rev_large_value, 1.8m chunks	brotli 9	16m	1.63%	203 req/s

Some comments:

Brotli levels

Medium to high brotli levels bring significant improvements when using large compression blocks. They are a win for read performance in any case, at a 2x - 3x increased compression CPU usage. Compression levels 10 and 11 should not be used, as compression speeds fall off a cliff & even decompression suffers badly. 9 is the sweet spot for decompression.

Compression block sizes

While 16m compression block sizes provide some improved compression over 8m sizes by potentially picking up more repetitions of the same pattern, is is not clear that the cost in terms of IO is worth it. 8m might be close to a sweet spot, with read sizes of around 70kb when using mixed content (compression ratio 1.2%). 16m blocks at a compression ratio of 0.87% result in a 142k read per compression block.

Edit: In the read size / throughput graph for the SSDs we are using, 128k is where throughput plateaus. It is possible that 16m compression blocks squeeze in just below that threshold with a mixed data set.

Chunked storage vs. key_rev_value

Chunked storage is relatively competitive if the number of chunks work out to one. A chunk size of 1.8m or 2m would make this true for all but the most extreme pages. We should perhaps perform some tests with some of the super wide rows listed in T94121, to see if a regular key_rev_value with vastly improved compression ratio would solve the scalability problem.

Another option worth investigating is the previous idea of storing metadata and a limited amount of content in a first chunk, and storing the remainder of overly large pages in overflow chunks. This might provide a good middle ground between scalability and smaller-scale performance.

Today I played a bit with nodetool scrub timings of different brotli settings vs. deflate. The (somewhat surprising) results for parsoid_html with a few GB of revisions:

Compression	Block size	time nodetool scrub local_group_default_T_parsoid_html data	Compression ratio
deflate	16m	420s	23.6%
deflate	256k	507s
brotli 0	8m	53s
brotli 0	16m	50s	1.37%
brotli 3	8m	53s
brotli 3	16m	49s	1.31%
brotli 5	16m	104s	1.09%
brotli 6	8m	119s
brotli 6	16m	102s	1.07%
brotli 9	16m	158s	1.03%

The surprising bit is that brotli re-compacts the SSTable faster than deflate, even at compression level 9.

• GWicke mentioned this in T125906: Evaluate Brotli compression for Cassandra.Feb 8 2016, 5:45 PM

fgiunchedi subscribed.Feb 9 2016, 9:47 AM

• GWicke added a subtask: T125906: Evaluate Brotli compression for Cassandra.Feb 15 2016, 5:10 PM

Results from T125906 so far show decent stability and performance with brotli 8m compression blocks. A heap of 8g & G1HeapRegionSize of 32 was needed to avoid OOMs during heavy read activity.

While more testing is needed (especially around the general performance impact of 32m regions), it seems clear that key_rev_value buckets combined with large-block brotli compression are a viable option for archival storage. Compared to chunked storage, strong points are

easy migration of existing data by switching key_rev_value buckets to brotli compression, and
basically no code changes needed beyond the compression library in Cassandra.

Chunked storage can potentially achieve even higher compression ratios, and can do so even for extremely large pages. However, the data collected so far suggests that brotli gets us far enough for a fraction of the cost.

• GWicke renamed this task from RFC: Chunked storage algorithms for archival data to RFC: Chunked storage algorithms for archival data vs. large-window brotli compression.Feb 16 2016, 8:36 PM

An illustration of the listing scaling issue is https://en.wikipedia.org/api/rest_v1/page/html/San_Francisco/, which currently times out reliably.

Danny_B added a project: Proposal.May 2 2016, 10:38 PM

• GWicke mentioned this in T107595: [RFC] Multi-Content Revisions.May 4 2016, 5:02 PM

• GWicke added a project: Services (next).Oct 12 2016, 4:58 PM

• GWicke mentioned this in T132815: Develop a RESTBase Cassandra cluster capacity planning document.

• GWicke mentioned this in T147961: Deploy Brotli compression support for Cassandra.Oct 12 2016, 7:22 PM

• GWicke mentioned this in T156209: Design notes for scalable and cost-effective revision archival storage.Jan 25 2017, 1:10 AM

Eevans closed subtask T125906: Evaluate Brotli compression for Cassandra as Declined.Jul 13 2017, 8:55 PM

@Eevans @mobrovac Do we have any plans to use brotly any more? Should we decline this?

In T122028#4394277, @Pchelolo wrote:

@Eevans @mobrovac Do we have any plans to use brotly any more? Should we decline this?

Brotli compression is interesting, but I do not believe this is a priority for us to work on. If anyone disagrees, feel free to re-open.

RFC: Chunked storage algorithms for archival data vs. large-window brotli compressionClosed, DeclinedPublicActions