Page MenuHomePhabricator

Estimate storage capacity needed for storing all HTML revisions
Closed, ResolvedPublic

Description

Eventually we'd like to store all article revisions as HTML. This task tries to establish a rough guess for the storage we'd need for this.

  • After ~1 1/2 months of operation, all HTML revisions rendered across all wikipedias take up about 6T of space in the cluster. This does not include non-article pages that were not edited or otherwise re-rendered in this period. The size of articles only right after cluster setup was about 1.5T (all three-way replicated).
  • On enwiki, the ratio of pages to articles in ns0 is ~36 million to ~5 million.
  • In the large wikis, the mean number of revisions is around 20 revisions / article (enwiki: 21, dewiki: 28, frwiki: 17). Smaller wikis tend to have less edits per article.

Assuming a mean of 20 revisions per page and three-way replication, this means that:

  • we'd need at least ~20 * 1.5T = 30T of storage for articles (ns0) only, and
  • we'd need at least 20 * 6T = ~120T of storage for all pages.

The assumption of 20 revisions / page and uniform page size is probably conservative, so there's a non-zero chance that these numbers could actually work out in real life.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke added subscribers: GWicke, mobrovac, Eevans.
GWicke claimed this task.
GWicke moved this task from In progress to Done on the RESTBase board.