Eventually we'd like to store all article revisions as HTML. This task tries to establish a rough guess for the storage we'd need for this.
- After ~1 1/2 months of operation, all HTML revisions rendered across all wikipedias take up about 6T of space in the cluster. This does not include non-article pages that were not edited or otherwise re-rendered in this period. The size of articles only right after cluster setup was about 1.5T (all three-way replicated).
- On enwiki, the ratio of pages to articles in ns0 is ~36 million to ~5 million.
- In the large wikis, the mean number of revisions is around 20 revisions / article (enwiki: 21, dewiki: 28, frwiki: 17). Smaller wikis tend to have less edits per article.
Assuming a mean of 20 revisions per page and three-way replication, this means that:
- we'd need at least ~20 * 1.5T = 30T of storage for articles (ns0) only, and
- we'd need at least 20 * 6T = ~120T of storage for all pages.
The assumption of 20 revisions / page and uniform page size is probably conservative, so there's a non-zero chance that these numbers could actually work out in real life.