tl;dr: We project a need for ~35T of additional storage in the next fiscal year, for a total of 53T per cluster. Given the number of nodes needed for this, we don't expect throughput to be an issue.
HTML revisions, data-parsoid
Current storage growth is on the order of 60G/day across the cluster.
We currently retain one render per revision, but would like to move to retaining one render per 24 hours in order to keep a history of often-changing templated pages like [[Main Page]] (use case: stable citations). Old revisions are rendered on demand, but we are not systematically traversing them in order to fill the storage. We don't expect to push for storing the full HTML history (yet) in the coming fiscal year. See T97710 for an estimate of full-history HTML storage.
Assuming no major changes in compression ratios, this means that the growth rate will increase slightly. The current storage will last us slightly beyond the end of this fiscal year, but it would be good to leave some reserve. Assuming a growth rate of 80G/day, we'll need about 29T of additional storage for the next fiscal year for HTML revisions.
ExternalStore, the MySQL-based system used to store wikitext revisions, is showing its age. We'll eventually need an operationally simpler, more reliable and efficient system. Cassandra / RESTBase can provide wikitext revision storage the same way it does for HTML, with the same advantages around compression, replication, load distribution and fail-over. Furthermore, we can use this to speed up wikitext dumps without affecting the production latency.
For enwiki, all bzip2-compressed wikitext article revisions take up about 112G of space. Assuming a ~50% worse compression ratio in Cassandra (ex: lzma with smaller blocks) and three-way replication, enwiki might take up around 750G of storage. Extrapolating roughly to all wikis, we should be able to store all wikitext revisions across all wikis with ~4T of storage.
Alternative HTML formats, misc services
The app team is currently developing a service that massages HTML in a mobile-friendly way, and wraps that up with some metadata in a JSON response. For performance, we plan to pre-generate this on edit. For this, we only need to keep around current revisions, which means that we should be able to handle this and other, smaller applications with ~2T of storage.
We do expect a growth in request volume, but given the fairly large margins we have right now combined with the possibility of caching hot entry points thanks to the REST layout we don't expect request throughput to be a limiting factor in cluster sizing.
Cassandra has mature support for DC-aware replication, which we plan to leverage by setting up a second cluster in codfw. We will replicate the full dataset, so will need the same storage capacity in codfw.