RESTBase is currently storing HTML and data-parsoid for one or more renders of each revision. Compression ratios are currently:
- 14.4% for data-parsoid
- 16.9% for html
At the current template update rates (~50/s) we are seeing a compressed storage growth on the order of 40G/day. This means that the currently provisioned 2.5T per node in a six-node cluster with three-way replication will only last for ~80 more days.
There are several options to reduce our storage requirements:
- De-duplicate template updates in RESTBase using the [If-Unmodified-Since header](https://phabricator.wikimedia.org/T93775) sent by the RestbaseUpdateJob extension: T93777
- Only store a render if the content actually changed, which is not the case for many template updates. Unfortunately, Parsoid HTML contains fairly random about attributes that make equality testing difficult: T93715
- Only store one render per revision. This would likely cut the storage requirements to ~1/3 - 1/4.
- Implement a large-window compression scheme like LZMA in Cassandra & bump up the block size: T93496
- Implement a storage hierarchy in which only recent revisions are stored in the first-level cluster while older ones are stored on an archive cluster. This could be interesting for geo-located RB installations as well, where we don't need lowest-latency access to historical revision data.