RESTBase is currently storing HTML and data-parsoid for one or more renders of each revision. Compression ratios for this so far are:
- 14.4% for data-parsoid
- 16.9% for html
At the current template update rates (~50/s) we are seeing a compressed storage growth on the order of 60G/day. This means that the provisioned 2.5T per node in a six-node cluster with three-way replication will only last for ~60 more days if we don't change anything.
There are however several options to reduce our storage requirements:
- De-duplicate template updates in RESTBase using the [If-Unmodified-Since header](https://phabricator.wikimedia.org/T93775) sent by the RestbaseUpdateJob extension: T93777
- {T93779}. Unfortunately, Parsoid HTML contains fairly random about attributes that make equality testing difficult: T93715
- Only store one render per revision. This would likely cut the storage requirements to ~1/2 - 1/3.
- Implement a large-window compression scheme like LZMA in Cassandra & bump up the block size: T93496
- Implement a storage hierarchy in which only recent revisions are stored in the first-level cluster while older ones are stored on an archive cluster. This could be interesting for geo-located RB installations as well, where we don't need lowest-latency access to historical revision data.
- {T94196}
- {T93790}
- Implement a storage hierarchy in which only recent revisions are stored in the first-level cluster while older ones are stored on an archive cluster. This could be interesting for geo-located RB installations as well, where we don't need lowest-latency access to historical revision data.