RESTBase is currently storing HTML and data-parsoid for one or more renders of each revision. Compression ratios for this so far are:
- 14.4% for data-parsoid
- 16.9% for html
At the current template update rates (~50/s) we are seeing a compressed storage growth on the order of 60G/day. This means that the provisioned 2.5T per node in a six-node cluster with three-way replication will only last for ~60 more days if we don't change anything.
There are however several options to reduce our storage requirements:
- De-duplicate template updates in RESTBase using the If-Unmodified-Since header sent by the RestbaseUpdateJob extension: T93777
- T93779: Only store a new render of Parsoid HTML / data-parsoid revision if the content actually changed after a template update. Unfortunately, Parsoid HTML contains fairly random about attributes that make equality testing difficult: T93715
- Only store one render per revision. This would likely cut the storage requirements to ~1/2 - 1/3.
- Implement a large-window compression scheme like LZMA in Cassandra & bump up the block size: T93496
- T94196: Thin out old revision renders and T94524: Configurable garbage collection / revision retention policy in table schemas: add 'interval' policy support
- T93790: Expand RESTBase cluster capacity
- Implement a storage hierarchy in which only recent revisions are stored in the first-level cluster while older ones are stored on an archive cluster. This could be interesting for geo-located RB installations as well, where we don't need lowest-latency access to historical revision data. See also Facebook pushing this one step further.