Page MenuHomePhabricator

RFC: Differentiate storage strategies for archival storage vs. hot current data
Closed, DeclinedPublic

Description

Our data model for HTML content does not distinguish between low-latency high-volume access to current revisions and long-term archival. This leaves some room for optimization for each of those two use cases.

Compression ratios for HTML content are currently at around 16% of the input size. Since the changes between revisions are actually small, ratios in the low single-digit percent ought to be possible. The main issue preventing this for HTML is currently the deflate window size of 32k not picking up repetitions between revisions of articles larger than 32k, which are relatively common. We could add brotli support to Cassandra to get larger windows, but to exploit this for whole-article storage we would then need to use extremely large input block sizes to pick up a decent number of repetitions. This in turn would likely make reads slower and more memory intense.

A better option to reduce storage needs could be to chunk content, ideally in alignment with semantic blocks like top-level sections in HTML. If these chunks are smaller than the compression algorithm's window size (32k for deflate), then it will pick up repetitions between chunks. Additionally, most edits only affect a single chunk in a large document. We can skip adding new versions of unchanged chunks altogether, which also reduces the write load. The first chunk should normally load more quickly than an entire document, reducing the time to first byte. There is also growing demand for section-based content loading at the API level, which can be efficiently supported by storing the content in sections in the first place.

The deflate / gzip window size of 32k is likely smaller than what we'd pick for an optimal trade-off between number of IOs and compression block size needed to pick up a decent number of repetitions. Adding Brotli support in Cassandra can give us a wider range of options for this trade-off. However, we can get started with deflate & leverage Brotli in a later iteration.

Another consideration is a separation of hot from cold storage, so that we can replicate hot data to the edge, but keep cold archival data only in two DCs and possibly on more density-optimized hardware. We can do this relatively easily by storing current revisions in a key-value bucket in addition to archival storage. Within this bucket, we can store each revision as an individually gzip compressed blob, ready to be streamed to the client without any de/re-compression overheads. The main gains in this scheme should come from the lack of extra reads and computation in Cassandra, as well as avoiding the need to compress data on the way out.

Schema sketch

title_revision_restriction table

Stores only restrictions on pages / revisions. This will be a very sparse table with no or very few entries for most pages. As a result, it will easily fit in page cache, making queries consistently fast.

  • index: (title, rev)
  • static column: page_deleted (tid)
  • non-key attributes: other per-revision restrictions

Current revision key-value bucket

Stores current revisions for each title, with gzip -9 compression per blob & no cassandra compression. Reads don't involve decompressing / re-compressing content. The etag header contains the revision & tid.

Archival: key_rev_value bucket, possibly a chunked_key_rev_value bucket (T122028)

Stores the current *and* old versions of pages. Storage is density-optimized, possibly at the cost of some read latency.

Read path for /{title}

In parallel,

  • query title_revision_restriction for the title, limit 1
  • query current revision key-value bucket for the content

If any restrictions were returned, check if

a) the page is marked as deleted (static column), or
b) the restriction revision matches the blob revision.

Else, return the content.

Read path for /{title}/{revision}

In parallel,

  • query title_revision_restriction for the title and revision, limit 1
  • query the key_rev_value bucket for the title / revision.

If restrictions are returned, check if

a) the page is marked as deleted (static column), or
b) the restriction revision matches the blob revision.

Else, return the content.

See also

Related Objects

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke subscribed.

Based on the data we have so far from T122028, the archival use case looks well covered by key_rev_value storage along with brotli compression. Since we are already using key_rev_value for revision storage, this offers us a very gradual migration path, with the option of deferring brotli compression until the cluster is finally fully expanded & converted to a multi-instance setup.

The split between current versions & archival of older revisions will be used for several content types, including HTML and data-mw. I think it makes sense to encapsulate the general functionality in a hybrid bucket abstraction on top of key_value and key_rev_value.

Access checks and title normalization might actually better be handled by a middleware / request wrapper module. T127132 has some early ideas for how this might look.

This is also partly relevant for T94121: Understand and solve wide row issues for frequently edited and re-rendered pages. Splitting current revisions from older ones would avoid the need to read from very wide rows for current data. It is not a panacea & won't do anything for old data, but could be a significant contribution to an overall solution.

We have moved away from the idea of storing an archive of HTML for revisions.