Page MenuHomePhabricator

RESTBase capacity planning for 2015/16
Closed, ResolvedPublic

Description

tl;dr: We project a need for ~35T of additional storage in the next fiscal year, for a total of 53T per cluster. Given the number of nodes needed for this, we don't expect throughput to be an issue.

Storage

HTML revisions, data-parsoid

Current storage growth is on the order of 60G/day across the cluster.

We currently retain one render per revision, but would like to move to retaining one render per 24 hours in order to keep a history of often-changing templated pages like [[Main Page]] (use case: stable citations). Old revisions are rendered on demand, but we are not systematically traversing them in order to fill the storage. We don't expect to push for storing the full HTML history (yet) in the coming fiscal year. See T97710 for an estimate of full-history HTML storage.

Assuming no major changes in compression ratios, this means that the growth rate will increase slightly. The current storage will last us slightly beyond the end of this fiscal year, but it would be good to leave some reserve. Assuming a growth rate of 80G/day, we'll need about 29T of additional storage for the next fiscal year for HTML revisions.

Wikitext history

ExternalStore, the MySQL-based system used to store wikitext revisions, is showing its age. We'll eventually need an operationally simpler, more reliable and efficient system. Cassandra / RESTBase can provide wikitext revision storage the same way it does for HTML, with the same advantages around compression, replication, load distribution and fail-over. Furthermore, we can use this to speed up wikitext dumps without affecting the production latency.

For enwiki, all bzip2-compressed wikitext article revisions take up about 112G of space. Assuming a ~50% worse compression ratio in Cassandra (ex: lzma with smaller blocks) and three-way replication, enwiki might take up around 750G of storage. Extrapolating roughly to all wikis, we should be able to store all wikitext revisions across all wikis with ~4T of storage.

Alternative HTML formats, misc services

The app team is currently developing a service that massages HTML in a mobile-friendly way, and wraps that up with some metadata in a JSON response. For performance, we plan to pre-generate this on edit. For this, we only need to keep around current revisions, which means that we should be able to handle this and other, smaller applications with ~2T of storage.

Throughput

We do expect a growth in request volume, but given the fairly large margins we have right now combined with the possibility of caching hot entry points thanks to the REST layout we don't expect request throughput to be a limiting factor in cluster sizing.

Multi-datacenter operation

Cassandra has mature support for DC-aware replication, which we plan to leverage by setting up a second cluster in codfw. We will replicate the full dataset, so will need the same storage capacity in codfw.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke added subscribers: Eevans, mobrovac, GWicke.

I'm assuming we're talking space raw/post-replication, thus 29T total, correct?

53 total, which means 35 more in eqiad. And yes, this is raw storage space on hw nodes, taking into account three-way replication, compression etc.

@GWicke: Given the fact that we're currently planning a large scale expansion of the RESTBase/Cassandra clusters at significant cost which are all based on these numbers, I'd like to challenge them a bit.

The bulk of the necessary storage seems to be for HTML revision storage, listed as 29 TB (times 2) by the end of this fiscal year. This is massively more than previously required in the parsoidcache solution, consisting of just two Varnish cache servers. My understanding is that with RESTbase/Cassandra, the attempt is to obtain a 100% hit rate, by storing *all* revisions, instead of just a hot set. Given the large impact on used storage space and necessary funds for this, my first question is: where can I find the rationale for this decision? Why is this necessary and how does it improve on just caching the hot set?

Working on the assumption that this is indeed needed, I'd still like to understand the given numbers a bit better:

The current daily growth rate seems to be around 50 GB/day[1]. Could you elaborate a bit more on why the current growth rate will "increase slightly", and where the figure of 80 GB/day comes from? (Stable citations? Where can I find out more about this, and is it on the roadmap this year?)

The move towards RESTBase/Cassandra for wikitext storage (instead of External Storage) seems all but certain at this point. If that wouldn't happen this year, that would remove an additional 4 TB per cluster (8 TB total) from these figures.

Therefore, assuming just the current growth rate of 50 GB/day, and no external storage replacement, the additional storage needed could be as low as ~20 TB instead with HTML revision storage + misc services - or even less, with reduced caching.

[1] http://tinyurl.com/q4429aq

Mark, this ticket is perhaps not the best place and time to discuss our high-level roadmap, or revisit the budget discussions. But, let me try to give you some background.

(HTML revision storage): my first question is: where can I find the rationale for this decision? Why is this necessary and how does it improve on just caching the hot set?

Here are some of the benefits of (eventually) storing HTML of old revisions:

  • Performance: Stored HTML can be reliably retrieved with latencies of below 100ms, while parsing a page from scratch can take 30+ seconds, and consume significant CPU and IO resources. From a user perspective, this speeds up VisualEditor edits of old revisions to a level similar to the wikitext editor. Views of old revisions (from citations, for example) will be reliably fast. Diff views will also be sped up, as they include a render of the full page at the given revision, which is not currently cached. By avoiding the need to re-render old pages from scratch, storage will slightly reduce the load on caches, MediaWiki app servers and the Parsoid cluster.
  • Reduction of DOS risks: As re-rendering old revisions is so expensive, spidering the history of large articles makes for a relatively easy DOS target.
  • Enable analytics: Vertical analytics about edit behavior are currently very expensive to generate, as each revision needs to be re-parsed from scratch. The complexities of wikitext parsing in such studies limit access to our information for the wider research community, and thus our ability to leverage their talents.
  • HTML dumps: Dumps are important to enable re-using our content widely, both for research and user-facing applications such as offline readers. By providing our information in a standard format like HTML, we live up to our promise of making the world's information available to everyone. While some use cases only require current revisions, others (like analytics) do require the full history.
  • Enable new features: A full history in HTML with stable identifiers lets us build up metadata about revisions. For example, we can build up blame / praise maps identifying authorship of parts of a revision. We can attach additional information that would not be appropriate to include in wikitext, such as user comments or ratings.
  • Stable citations and links: We encourage people to corroborate information with citations, but yet don't provide a way to reliably cite a specific version of our content. As an example, following a citation to [[Main Page]] will normally return completely different content from what the the person citing it originally saw. Previous attempts to solve this problem were prohibitively expensive and complex, while storing expanded HTML does not suffer from those issues.

This is not intended to be an exhaustive list, but I do hope that it gives you an idea.

Note that we do not plan to store the full HTML history this year, and that we are also considering separating current data from history storage in the future (see T93751) to more economically support geo-distribution. This would enable us to only keep the history in two datacenters, while pushing current revisions closer to the edge.

Could you elaborate a bit more on why the current growth rate will "increase slightly"

Storing one render per day will use more storage than storing one render per revision. We are also storing more per-revision metadata to support section editing, and plan to add more metadata about citations and other bits to support editing.

We intend to start storing wikitext as soon as we have spare capacity to do this. That will let us test things thoroughly while already using it to speed up VisualEditor edits, Aaron's research and possibly dumps. We should be in a pretty good position to make a decision on phasing out ExternalStore towards the end of of this year. Doing so will save significant capital and operational expenses, and remove some single points of failure from our storage infrastructure.

Mark, this ticket is perhaps not the best place and time to discuss our high-level roadmap, or revisit the budget discussions. But, let me try to give you some background.

(HTML revision storage): my first question is: where can I find the rationale for this decision? Why is this necessary and how does it improve on just caching the hot set?

Here are some of the benefits of (eventually) storing HTML of old revisions:

  • Performance: Stored HTML can be reliably retrieved with latencies of below 100ms, while parsing a page from scratch can take 30+ seconds, and consume significant CPU and IO resources. From a user perspective, this speeds up VisualEditor edits of old revisions to a level similar to the wikitext editor.

When I asked you for metrics to support this view, back in May, you indicated that your tests "seem to indicate that in most cases the HTML retrieval was already fast enough before", and conceded that the RESTbase rollout "didn't really make a difference to overall load time".

  • Reduction of DOS risks: As re-rendering old revisions is so expensive, spidering the history of large articles makes for a relatively easy DOS target.

Re-rendering old revisions is expensive as a consequence of design decisions you had made — an outcome @tstarling had anticipated and warned about. Now you are citing this outcome to authorize a new round of design decisions. It is a little hard to swallow.

  • Enable analytics: Vertical analytics about edit behavior are currently very expensive to generate, as each revision needs to be re-parsed from scratch. The complexities of wikitext parsing in such studies limit access to our information for the wider research community, and thus our ability to leverage their talents.

This is fantastically out-of-scope.

  • HTML dumps: Dumps are important to enable re-using our content widely, both for research and user-facing applications such as offline readers. By providing our information in a standard format like HTML, we live up to our promise of making the world's information available to everyone. While some use cases only require current revisions, others (like analytics) do require the full history.

So is this.

  • Enable new features: A full history in HTML with stable identifiers lets us build up metadata about revisions. For example, we can build up blame / praise maps identifying authorship of parts of a revision. We can attach additional information that would not be appropriate to include in wikitext, such as user comments or ratings.

And this.

  • Stable citations and links: We encourage people to corroborate information with citations, but yet don't provide a way to reliably cite a specific version of our content. As an example, following a citation to [[Main Page]] will normally return completely different content from what the the person citing it originally saw. Previous attempts to solve this problem were prohibitively expensive and complex, while storing expanded HTML does not suffer from those issues.

Storing expanded HTML is not prohibitively expensive?

(VE load times)

When I asked you for metrics to support this view, back in May, you indicated that your tests "seem to indicate that in most cases the HTML retrieval was already fast enough before", and conceded that the RESTbase rollout "didn't really make a difference to overall load time".

As you know (and can check in grafana), RESTBase made a significant difference to HTML load times by eliminating cache misses and slimming down the HTML. The second step of removing proxying through the PHP API further reduced latencies as measured at the server by several hundred ms in the mean, but at this point the API response was already fast enough to not stall VE initialization in most cases. If you look at the VE initialization timeline, you see that the API responses typically arrive during the first heavy block of initialization, and well before the HTML -> linear model -> CE transformation. When we last chatted about this in May, you yourself said that you had hoped to make serious dents on the client-side CPU usage during the VE performance drive, but didn't have sufficient time for it. I'm optimistic that VE will be sped up further in the future, at which point the faster API response times will make a further difference to user-observed load times.

We need both fast and scalable APIs *and* client-side improvements, and it makes sense to work on both in parallel.

Re-rendering old revisions is expensive as a consequence of design decisions you had made

Rendering things from scratch is expensive in both the PHP parser and Parsoid, and so far nobody has discovered magic solutions that make this go away. If you want to make the case for rewriting Parsoid in C++ now, then please do so. From my perspective, it was definitely the right decision not to rewrite at the time given two engineers and a three-month timeline. We might not have VisualEditor otherwise.

Try ab 'https://en.wikipedia.org/w/index.php?title=Barack_Obama&oldid=672700695'. Even with repeated hits to the same revision I'm getting a load time of about 5.3s from tin, and there are significantly more expensive pages. These pages are also not cached in Varnish, so any reference to a specific revision is getting poor performance. Then switching to VisualEditor for this old revision is going to take even longer, as Parsoid needs to run fairly complex algorithms to enable round-tripping.

Now, compare this to ab https://en.wikipedia.org/api/rest_v1/page/html/Barack_Obama/672700695. From tin, this takes about 90ms, with very few server-side resources spent on generating this page.

About scoping: Teams have different scopes, so something you consider out of scope for Performance might well be in scope for Services. A part of the mission of the services team is to enable innovation both inside & outside the foundation by providing good APIs and access to content in standard formats. While full-history HTML dumps aren't our top priority by any stretch (and not something we have spent time on), I do consider enabling them potentially in scope of our team mission.

Note that we do not plan to store the full HTML history this year, and that we are also considering separating current data from history storage in the future (see T93751) to more economically support geo-distribution. This would enable us to only keep the history in two datacenters, while pushing current revisions closer to the edge.

When did that became a goal? Where have you discussed this RESTBase-at-the-edge strategy and who has agreed to it or even planned for it for our current and future PoPs procurements? I'm quite surprised to see this mentioned here -- I've only heard about this again once when you mentioned it completely in passing in your presentation at the Lyon hackathon but did not think it was something that was being worked on.

We should be in a pretty good position to make a decision on phasing out ExternalStore towards the end of of this year. Doing so will save significant capital and operational expenses, and remove some single points of failure from our storage infrastructure.

As much as I would love deprecating a piece of infrastructure for a change and reducing our workload instead of adding to it, it's hard to agree to this specific justification. ExternalStore costs very little in both capex and opex (to the point we have forgotten about it and haven't even been running the compaction maintenance jobs), is storing data very efficiently, running on very mature and stable tech and has been running incredibly reliably for a really long time. All of the above is in direct contrast with RESTBase -- RESTBase is actually increasing our capex, opex and risks considerably (whether that's worth it or not for its benefits, that's a different discussion).

When did that became a goal? Where have you discussed this RESTBase-at-the-edge strategy and who has agreed to it or even planned for it for our current and future PoPs procurements? I'm quite surprised to see this mentioned here -- I've only heard about this again once when you mentioned it completely in passing in your presentation at the Lyon hackathon but did not think it was something that was being worked on.

It is not much more than an idea and technical possibility at this point, and not something we are working on or have as a priority this quarter.

ExternalStore

The ExternalStore use case is tracked in T100705. If you think that it's not worth bothering, then please make the case for it there.

I would also like to remind all of you to keep some perspective on the costs. We are talking about maybe $15k for HTML storage per DC & year, and we have been pushing rather hard for cost effective solutions.

Mark, this ticket is perhaps not the best place and time to discuss our high-level roadmap, or revisit the budget discussions. But, let me try to give you some background.

I agree that this ticket is not the place to discuss high level roadmaps, please feel free to link to more relevant ones. However, it is appropriate for the budget discussion, as we're now making purchasing decisions heavily based on this data.

(HTML revision storage): my first question is: where can I find the rationale for this decision? Why is this necessary and how does it improve on just caching the hot set?

Here are some of the benefits of (eventually) storing HTML of old revisions:

...but this is about storing the full history, which we're not planning to yet this year, and are not included in the figures above, correct?

Could you please be explicit about what we are doing with expanded HTML storage now/this year? Is it storage of ALL new revisions, plus (planned) storage plus one render per 24 hours? Is the plan to store everything 100% or will some be cached, which I believe RESTbase now supports?

Note that we do not plan to store the full HTML history this year, and that we are also considering separating current data from history storage in the future (see T93751) to more economically support geo-distribution. This would enable us to only keep the history in two datacenters, while pushing current revisions closer to the edge.

We've also been asked to prepare a capital plan for the next 3-5 years, so if the plan is to start storing the revision history as well beyond this fiscal year, it sounds like that may have a big impact in the upcoming fiscal years. Could you please prepare some estimates for the total storage needed for that as well?

Could you elaborate a bit more on why the current growth rate will "increase slightly"

Storing one render per day will use more storage than storing one render per revision. We are also storing more per-revision metadata to support section editing, and plan to add more metadata about citations and other bits to support editing.

Right. So given that we're currently seeing less than 50 GB/day according to the graphs, what is the figure of 80 GB/day based on?

I would also like to remind all of you to keep some perspective on the costs. We are talking about maybe $15k for HTML storage per DC & year, and we have been pushing rather hard for cost effective solutions.

I don't understand how you arrived at this number. In this task, you're claiming we'll need 29 TB of additional raw storage per year, per cluster, for HTML storage (without history) alone. This amounts to 58 TB of additional storage for the two main data centers (plus the initial capacity match buildout of codfw). With the maximum of 600 GB raw data per Cassandra instance that you proposed, this implies almost 100 additional Cassandra instances alone.

Even with the most cost effective configuration of 5 instances per system currently under discussion (which we're not yet comfortable with and consider quite risky), we're already looking at ~20 additional systems in the ballpark of $125k total - just for the additional HTML storage without complete history, this year. And that's really pushing it, and more realistic/stable solutions end up more. Where does the discrepancy with $30k per year come from?

You are indeed pushing rather hard for cost effective solutions based on the capacity numbers calculated here, which is one of the reasons I'm asking for more information. :)

The original analysis in the task description was done in April for budget planning purposes, using then available data. Since then, we have optimized a lot of things (see T93751 for some of the optimizations), and collected a lot more information. So, let me re-do the analysis based on the latest information:

Through June (the last longer period without deletions or cluster expansions) the overall storage growth has been around 1T/month or about 32G / day, despite now tracking changes to all projects:

pasted_file (1×1 px, 296 KB)

This includes the history of current renders, and some amount of requests for historical revisions that users made, but no render of the full history. One render per revision was retained. With more metadata and one render retained at most per day, this might increase to perhaps 15T (conservatively), so almost half of what we got based on the March-April data.

Misc services like mobile apps, graphoid, mathoid etc might ultimately use up a bit more storage, despite only storing current revisions for those. I'd up my estimate for those to 8T total. This space need should remain fairly constant.

For wikitext, I fear that my estimate might actually end up being a bit optimistic (compression ratios + overheads might be worse than 50% of bzip2), so ~6T might be a safer number. The annual growth for wikitext should be moderate (< 1T/year).

This adds up to about 29T, so we should have some wiggle room in the budget. There are still many variables that can affect these numbers in both directions, but we can take a wait & see approach and procure hardware as needed.

There are several options we can explore to bring down the cost of old revision storage and thus annual growth even further, including a lower replication factor (like one copy per DC) for old revisions, as discussed under "storage hierarchies" in T93751. Lets see how that goes, and make decisions based on the data and priorities we have then.

All that said, I am not convinced that we should go for significantly less cost effective hardware just because the budget allows it. If we spend more, then I think there should be a clear technical justification for doing so.

The original analysis in the task description was done in April for budget planning purposes, using then available data. Since then, we have optimized a lot of things (see T93751 for some of the optimizations), and collected a lot more information. So, let me re-do the analysis based on the latest information:

Through June (the last longer period without deletions or cluster expansions) the overall storage growth has been around 1T/month or about 32G / day, despite now tracking changes to all projects:

pasted_file (1×1 px, 296 KB)

This includes the history of current renders, and some amount of requests for historical revisions that users made, but no render of the full history. One render per revision was retained. With more metadata and one render retained at most per day, this might increase to perhaps 15T (conservatively), so almost half of what we got based on the March-April data.

Misc services like mobile apps, graphoid, mathoid etc might ultimately use up a bit more storage, despite only storing current revisions for those. I'd up my estimate for those to 8T total. This space need should remain fairly constant.

For wikitext, I fear that my estimate might actually end up being a bit optimistic (compression ratios + overheads might be worse than 50% of bzip2), so ~6T might be a safer number. The annual growth for wikitext should be moderate (< 1T/year).

I think we should be comparing the original estimation also with e.g. externalstorage usage to double check this figure

I think we should be comparing the original estimation also with e.g. externalstorage usage to double check this figure

Yes, that would be interesting. All these numbers are with three-way replication, so the prediction is for all wikitext revisions to fit in 2T without replication, using deflate compression. I would expect the same content to use a bit more space in ES, because of it's predominant use of per-revision compression and higher MySQL overheads.

https://wikitech.wikimedia.org/wiki/External_storage#Servers suggests that each of the ES servers has 12T space on a RAID10, and

https://graphite.wikimedia.org/render/?width=588&height=310&_salt=1439569155.787&target=servers.es10*.diskspace._srv.byte_used

shows a usage of about 8.5T on most ES nodes.

So, that's a lot more space than the 2T unreplicated projected for Cassandra (based on dump sizes), even assuming that 8.5T is the full ExternalStore wikitext dataset. I guess the most reliable way to find out is to import one dump, gauge the compression ratios vs. dump sizes, and then update the extrapolation to all wikis.

Some more info on ExternalStore:

  • current revisions seem to be stored completely uncompressed (!)
  • total usage is 8.5T on es1 plus 2.9T each on es2 and es3, for a total of 15.3T
  • for dewiki, revision.rev_len add up to 2.8T, plus 81G archived revisions
  • for enwiki, ar_len (deleted pages) adds up to 283G; rev_len times out in labs

@GWicke I had some chat with Tim to fully undestand ExS storage medium. So, some of the thing I told you were innacurate (because at the time I didn't undestand the model):

  • Current revisions, at least the main content (page history), is stored compressed. However, there are 2 types of compression, per-row and on the fly (with a 10-20%? compression) and offline (Tim's scripts). The offline one uses a very clever compacting algorithm allowing 10 times or better compression by using diffs and page history, even if it does not have a measurable impact on read performance (memcaches still has whole rows).
  • So, the total now is 15.3T (without duplicates and RAID mirrors), some of it with gzip-level compression, and some of it (around half the total?) with a huge compression ratio after a long processing offline
  • There is a weekly growth of around 50GB/week (row-compressed)

@jcrespo, thanks for the clarification.

https://wikitech.wikimedia.org/wiki/Text_storage_data has a distribution of storage types as of 2011. It shows that the majority of revisions was individually gzipped at that point. As we have been using that setting since then, this type should dominate now.

There is a weekly growth of around 50GB/week (row-compressed)

This translates to 2.6T per year, which is more than the < 1T I predicted for Cassandra storage assuming a better compression ratio. The exact value will depend a lot on the compression algorithms and block sizes. 1T might be tight with deflate, but should be doable with lzma.

The original analysis in the task description was done in April for budget planning purposes, using then available data. Since then, we have optimized a lot of things (see T93751 for some of the optimizations), and collected a lot more information. So, let me re-do the analysis based on the latest information:

[snip]

There are several options we can explore to bring down the cost of old revision storage and thus annual growth even further, including a lower replication factor (like one copy per DC) for old revisions, as discussed under "storage hierarchies" in T93751. Lets see how that goes, and make decisions based on the data and priorities we have then.

All that said, I am not convinced that we should go for significantly less cost effective hardware just because the budget allows it. If we spend more, then I think there should be a clear technical justification for doing so.

I agree. But that also means we shouldn't be incurring extra risks (5 instances on a single node) in order to fit a supposedly very constrained budget, if it turns out now that we have some wiggle room after all. We've said from the start that we should start testing with 2. What eventually consistitutes "cost effective hardware" depends entirely on the outcome of that performance/stability test, later. I'm fine with buying hardware allowing for more instances, ideally in an upgradeable way, but that can't then be used as justification for pushing it using these budget figures.

But that also means we shouldn't be incurring extra risks

If there are clear technical reasons that make three instances significantly less risky than five, then I agree that it might be worth paying more for smaller hardware. However, so far I am not aware of any data or clear technical facts to support this assertion.

Upgrading later seems to be expensive in terms of our time and might not even be possible with leasing. The worst case with going with larger hardware in the first round would be to have more RAM than we need for three instances, and costs that aren't that far from the best case with smaller hardware.

GWicke claimed this task.