Page MenuHomePhabricator

Capacity planning for elastic search
Closed, ResolvedPublic

Description

We need to do some capacity planning for Elastic via replaying prod traffic at a higher velocity at our dallas DC, findings to inform our hardware asks and to be documented in wikitech (let's please document the how to do the testing and the results separately)

https://wikitech.wikimedia.org/wiki/Search/ElasticSearch/LoadTesting

Event Timeline

While not fully documented, our the results of previous load testing rounds and the methodology used are described here:

A few variations that might be useful to test (using gor middleware to modify the queries). These would mostly inform our options for reducing server load if necessary for incident response:

  • Reduce LTR rescore window
  • Removing the LTR rescore
  • Reduce popularity rescore window

Mentioned in SAL (#wikimedia-operations) [2019-05-01T12:53:56Z] <gehel> start recording 30 minutes of traffic from elasticsearch eqiad - T221121

Mentioned in SAL (#wikimedia-operations) [2019-05-02T16:42:30Z] <gehel> replaying 30 minutes of eqiad search traffic on codfw - T221121

Mentioned in SAL (#wikimedia-operations) [2019-05-03T12:26:07Z] <gehel> replaying 30 minutes of eqiad search traffic on codfw - T221121

Executive summary: we should have enough capacity for next year.

hypothesis to validate: elasticsearch clusters have enough capacity to support the load for the next year. As an estimate, we cap that increase at 1.5x our current load.

30 minutes of traffic were captured. Using completion suggester as an indicator, the captured traffic is ~3/4 of the peak daily traffic. So roughly, 150% of that load is ~ our daily traffic peak and 200% of that load is ~ 1.5x our daily peak.

2 sets of tests have been performed:

test 1: loop the input file (--input-file-loop) and run 30 minutes of load, replayed at 100%, 150% and 200% speed.
test 2: let the input file run to completion and vary replay speed (100%, 150% and 200%)

Observations:

  • the runs of test 2 took ~ 31', 22' and 17' respectively, which indicates that there were no throughput bottlenecks
  • per node percentiles show that response times were mostly constant between the various tests
  • CPU usage (as seen on a single node) climbed from ~50% for the first test to ~80% on the last one.
  • Disk utilization (again, on a single host) went from ~12% to ~17% to ~23%. The effect of cache warmup can clearly be seen, a longer test might have better performances.
  • Cluster wide aggregates show similar metrics.

Conclusions:

This is a very simplified test, with only few application based metrics (our usual metrics are collected on the mediawiki side, which was not involved in this test). That being said, it looks like an overall increase of search traffic by 150% over next year should not cause issues. We will already get some capacity increase as we replace a number of older servers with new one with higher specs. It looks like we're all good for next year.

Excellent, can we document this findings on wikitech so they are easy to find?

Did we also take into account codfw being smaller? If we recorded on 35 nodes in eqiad but only replayed from 30 nodes in codfw then we are replaying 86% of the actual traffic. That requires a playback speed of 1.17x to get codfw replaying at original request rates. 4/3 of that to get to peak daily traffic requires replay at 1.55x. 150% of that, to estimate organic growth, brings us to 2.33x replay speed.

35/30 = 1.167 # Replay speed to get codfw to recorded rate
1.16 * (4/3) = 1.55 # Replay speed to get to peak current traffic
1.55 * 1.5 = 2.333 # Estimated peak organic growth over 1 year

Did we also take into account codfw being smaller? If we recorded on 35 nodes in eqiad but only replayed from 30 nodes in codfw then we are replaying 86% of the actual traffic.

No, this was not taken into account, since we are already expecting to get codfw back to 36 nodes. So traffic from 30 nodes was captured and replayed on 30 nodes in codfw. Which should mean that the load per node is equivalent to what we will have when the cluster is back to its normal size.

Excellent, can we document this findings on wikitech so they are easy to find?

Sure, I can copy the summary somewhere. Is it really easier to find on wiki than in phab? (I get lost in both).

Hacked scripts used for this test are at https://github.com/gehel/es-load-test (note that it might make sense to move them to gerrit).