tl;dr: We project a need for **~34T of additional storage** in the next fiscal year. Given the number of nodes needed for this, we don't expect throughput to be an issue.
## HTML revisions, data-parsoid
Current storage growth is on the order of 60G/day across the cluster.
We currently retain one render per revision, but would like to move to retaining one render per 24 hours in order to keep a history of often-changing templated pages like [[Main Page]] (use case: stable citations). Old revisions are rendered on demand, but we are not systematically traversing them in order to fill the storage. We don't expect to push for storing the full HTML history (yet) in the coming fiscal year.
Assuming no major changes in compression ratios, this means that the growth rate will increase slightly. The current storage will last us slightly beyond the end of this fiscal year, but it would be good to leave some reserve. Assuming a growth rate of 80G/day, we'll need about **29T of storage for the next fiscal year for HTML revisions**.
## Wikitext history
ExternalStore, the MySQL-based system used to store wikitext revisions, is showing its age. We'll eventually need an operationally simpler, more reliable and efficient system. Cassandra / RESTBase can provide wikitext revision storage the same way it does for HTML, with the same advantages around compression, replication, load distribution and fail-over. Furthermore, we can use this to speed up wikitext dumps without affecting the production latency.
For enwiki, all bzip2-compressed wikitext revisions take up about 90G of space. Assuming a ~50% worse compression ratio in Cassandra (likely lzma with smaller blocks) and three-way replication, enwiki will take up around 600G of storage. Extrapolating roughly to all wikis, we should be able to store **all wikitext revisions across all wikis with ~3T of storage**.
## Alternative HTML formats, miscellaneous
The app team is currently developing a service that massages HTML in a mobile-friendly way, and wraps that up with some metadata in a JSON response. For performance, we plan to pre-generate this on edit. For this, we only need to keep around current revisions, which means that we should be able to handle this and other, smaller applications with **~2T of storage**.
We do expect a growth in request volume, but given the fairly large margins we have right now combined with the possibility of caching hot entry points thanks to the REST layout we don't expect request throughput to be a limiting factor in cluster sizing.
# Multi-datacenter operation
Cassandra has mature support for DC-aware replication, which we plan to leverage by setting up a second cluster in codfw. We will replicate the full dataset, so will need the same storage capacity in codfw.