Maniphest T120171

RFC: Differentiate storage strategies for archival storage vs. hot current data
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• GWicke
	Dec 3 2015, 6:13 AM

Description

Our data model for HTML content does not distinguish between low-latency high-volume access to current revisions and long-term archival. This leaves some room for optimization for each of those two use cases.

Compression ratios for HTML content are currently at around 16% of the input size. Since the changes between revisions are actually small, ratios in the low single-digit percent ought to be possible. The main issue preventing this for HTML is currently the deflate window size of 32k not picking up repetitions between revisions of articles larger than 32k, which are relatively common. We could add brotli support to Cassandra to get larger windows, but to exploit this for whole-article storage we would then need to use extremely large input block sizes to pick up a decent number of repetitions. This in turn would likely make reads slower and more memory intense.

A better option to reduce storage needs could be to chunk content, ideally in alignment with semantic blocks like top-level sections in HTML. If these chunks are smaller than the compression algorithm's window size (32k for deflate), then it will pick up repetitions between chunks. Additionally, most edits only affect a single chunk in a large document. We can skip adding new versions of unchanged chunks altogether, which also reduces the write load. The first chunk should normally load more quickly than an entire document, reducing the time to first byte. There is also growing demand for section-based content loading at the API level, which can be efficiently supported by storing the content in sections in the first place.

The deflate / gzip window size of 32k is likely smaller than what we'd pick for an optimal trade-off between number of IOs and compression block size needed to pick up a decent number of repetitions. Adding Brotli support in Cassandra can give us a wider range of options for this trade-off. However, we can get started with deflate & leverage Brotli in a later iteration.

Another consideration is a separation of hot from cold storage, so that we can replicate hot data to the edge, but keep cold archival data only in two DCs and possibly on more density-optimized hardware. We can do this relatively easily by storing current revisions in a key-value bucket in addition to archival storage. Within this bucket, we can store each revision as an individually gzip compressed blob, ready to be streamed to the client without any de/re-compression overheads. The main gains in this scheme should come from the lack of extra reads and computation in Cassandra, as well as avoiding the need to compress data on the way out.

Schema sketch

`title_revision_restriction` table

Stores only restrictions on pages / revisions. This will be a very sparse table with no or very few entries for most pages. As a result, it will easily fit in page cache, making queries consistently fast.

index: (title, rev)
static column: page_deleted (tid)
non-key attributes: other per-revision restrictions

Current revision key-value bucket

Stores current revisions for each title, with gzip -9 compression per blob & no cassandra compression. Reads don't involve decompressing / re-compressing content. The etag header contains the revision & tid.

Archival: `key_rev_value` bucket, possibly a `chunked_key_rev_value` bucket (T122028)

Stores the current *and* old versions of pages. Storage is density-optimized, possibly at the cost of some read latency.

Read path for `/{title}`

In parallel,

query title_revision_restriction for the title, limit 1
query current revision key-value bucket for the content

If any restrictions were returned, check if

a) the page is marked as deleted (static column), or
b) the restriction revision matches the blob revision.

Else, return the content.

Read path for `/{title}/{revision}`

In parallel,

query title_revision_restriction for the title and revision, limit 1
query the key_rev_value bucket for the title / revision.

If restrictions are returned, check if

a) the page is marked as deleted (static column), or
b) the restriction revision matches the blob revision.

Else, return the content.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T120171 RFC: Differentiate storage strategies for archival storage vs. hot current data
Declined	None	T122028 RFC: Chunked storage algorithms for archival data vs. large-window brotli compression
Declined	Eevans	T125906 Evaluate Brotli compression for Cassandra
Resolved	Eevans	T126629 Cassandra 2.2.6
Resolved	None	T127365 Cassandra upgrades in staging attempted to start root instance
Open	None	T137419 Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)
Resolved	Eevans	T137474 Investigate lack of recency bias in Cassandra histogram metrics
Resolved	Eevans	T139639 RESTBase: Cassandra 2.2.6 post-upgrade checklist
Resolved	Eevans	T156199 Low-latency current revision storage
Resolved	Eevans	T157173 Verify requirements and parameters for efficient TTL'ed storage in Cassandra
Resolved	Eevans	T164865 Prototype and test range delete-based current revision storage
Declined	None	T156209 Design notes for scalable and cost-effective revision archival storage

Event Timeline

• GWicke created this task.Dec 3 2015, 6:13 AM

• GWicke raised the priority of this task from to Medium.

• GWicke updated the task description. (Show Details)

• GWicke added a project: RESTBase.

• GWicke subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 3 2015, 6:13 AM

• GWicke set Security to None.Dec 3 2015, 6:13 AM

• GWicke mentioned this in T118868: Services team goals January - March 2016 (Q3 2015/16).

• GWicke edited subscribers, added: • mobrovac, Eevans, • Pchelolo; removed: Aklapper.

• GWicke mentioned this in T94121: Understand and solve wide row issues for frequently edited and re-rendered pages.Dec 11 2015, 1:57 AM

• GWicke updated the task description. (Show Details)Dec 11 2015, 4:50 PM

• GWicke mentioned this in T121575: Expand SSD space in Cassandra cluster.Dec 15 2015, 8:51 PM

• GWicke updated the task description. (Show Details)Jan 16 2016, 2:24 AM

fgiunchedi subscribed.Feb 9 2016, 9:47 AM

Based on the data we have so far from T122028, the archival use case looks well covered by key_rev_value storage along with brotli compression. Since we are already using key_rev_value for revision storage, this offers us a very gradual migration path, with the option of deferring brotli compression until the cluster is finally fully expanded & converted to a multi-instance setup.

The split between current versions & archival of older revisions will be used for several content types, including HTML and data-mw. I think it makes sense to encapsulate the general functionality in a hybrid bucket abstraction on top of key_value and key_rev_value.

Access checks and title normalization might actually better be handled by a middleware / request wrapper module. T127132 has some early ideas for how this might look.

Danny_B added a project: Proposal.May 2 2016, 10:38 PM

• GWicke mentioned this in T139961: 9x or 15x additional Cassandra/RESTBase nodes.Oct 4 2016, 7:27 PM

• GWicke added a project: Services (later).Oct 12 2016, 4:50 PM

• GWicke updated the task description. (Show Details)Oct 12 2016, 5:47 PM

This is also partly relevant for T94121: Understand and solve wide row issues for frequently edited and re-rendered pages. Splitting current revisions from older ones would avoid the need to read from very wide rows for current data. It is not a panacea & won't do anything for old data, but could be a significant contribution to an overall solution.

• GWicke mentioned this in T156199: Low-latency current revision storage.Jan 24 2017, 10:37 PM

• GWicke created subtask T156199: Low-latency current revision storage.

• GWicke mentioned this in T156209: Design notes for scalable and cost-effective revision archival storage.Jan 25 2017, 1:10 AM

• GWicke created subtask T156209: Design notes for scalable and cost-effective revision archival storage.

Eevans mentioned this in T153123: Consider disabling read repairs in RESTBase Cassandra cluster.Jan 25 2017, 10:27 PM

• Mholloway subscribed.Apr 20 2017, 9:27 PM

Eevans closed subtask T156199: Low-latency current revision storage as Resolved.Feb 5 2018, 9:45 PM

Eevans closed subtask T122028: RFC: Chunked storage algorithms for archival data vs. large-window brotli compression as Declined.Jul 3 2018, 7:49 PM

• mobrovac added a project: Platform Team Legacy (Later).Dec 20 2018, 12:07 PM

We have moved away from the idea of storing an archive of HTML for revisions.

• Pchelolo closed subtask T156209: Design notes for scalable and cost-effective revision archival storage as Declined.Jul 31 2019, 10:06 PM

TheresNoTime mentioned this in T274359: Mobile REST API delivers year old+ content for very select pages.Nov 28 2021, 10:53 PM