Page MenuHomePhabricator

Elasticsearch health and capacity planning FY2016-17
Closed, ResolvedPublic

Description

Maintaining the elasticsearch cluster to support various search API's (opensearch, prefix, full text, more like, ???) is core work of the discovery team. Determine any additional hardware, if necessary, to support these API's through the next fiscal year. This needs to be decided by the time work budget is due Feb 3rd.

Event Timeline

EBernhardson claimed this task.
EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a project: Discovery.
EBernhardson added subscribers: EBernhardson, dcausse, Joe.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 25 2016, 4:18 AM
EBernhardson updated the task description. (Show Details)Jan 25 2016, 4:19 AM
EBernhardson set Security to None.
EBernhardson added a comment.EditedJan 25 2016, 4:55 AM

Over the last year elasticsearch load has increased fairly significantly. Significantly enough that the current eqiad hardware was not able to keep up with the work (T124100). We need to determine if any additional hardware is required for the next fiscal year to ensure we can take on organic growth through the next fiscal year without hitting capacity issues

For reference graphite reports[1] that the load average for the entire cluster rose from peaks of 180 or so a year ago to peaks just over 400 recently. This left our older servers, elastic10{01..16} at 80%+ cpu usage during peak hours and caused dramatic increases in latency during those periods. Shifting the more like feature to our second cluster in codfw has alleviated the issue for now, but is not a great long term solution. The existing older hardware is being replaced in the next six months or so, so even before any increases we decide to make here there is already additional computational capacity being added. But is it enough?

Ideally we need enough capacity in both the eqiad and codfw data centers to handle the full query load should one data center become unavailable, along with enough capacity for organic growth through the next fiscal year (~18 months from now). Unfortunately we don't seem to have enough data stored anywhere about query patterns to fully know all the factors that led to the doubling of our server load over the last 12 months, but for capacity planning purposes I think we should plan for another 100% increase in usage in the next 12 months.

We have a couple options:

1. increase the size of one or both existing elasticsearch clusters

Initial load testing of the codfw cluster is looking promising. We are currently seeing slightly more capacity than existing in eqiad, and after some adjustments to the sharding we are expecting to see close to double the capacity eqiad has maxed out at. When these older servers are at ~80% cpu usage and maxing out the newer servers are in the 35% range. This suggests, but does not guarantee, that the planned replacement of 16 older servers with machines matching or exceeding the spec of elastic10{17..31} will allow for a 100% increase in capacity from what exists today.

2. have two clusters per datacenter, split by feature

one option we have discussed, but not made any decisions on, is splitting off a second cluster in each datacenter. We have discussed splitting off a separate cluster for handling the completion suggester, as in in memory algorithm it has rather different needs from the full text search we provide.

3. have two clusters per datacenter, split by wiki size

We currently have 2718 indices in our elasticsearch cluster. Indices where the primary + replica add up to less than 100MB make up almost 2/3 (1781) of those indices. Indices less than 1 GB (again primary+replica) make up just under 87% of the total index count. Elasticsearch is not written in such a way to gracefully support such a large number of indices in a single cluster. We have been running into issues where the master node routinely fails to respond to updates within the default 30s timeout, and sometimes even fails with a 2 minute timeout. To ensure updates get through we have had to, at times, increase the master timeout to 10 minutes. Splitting the cluster into a large cluster containing the busiest wikis, and a small cluster serving the smaller wikis, may help to alleviate this.

summary

So what is the right option here? From a cluster health perspective, I like option 3: building a second smaller cluster in each datacenter to handle the multitude of smallish wikis (exact cutoff tbd). I expect this will increase the health of the master nodes while also opening up some room on the servers for data to grow. I suspect that a small 4 or 5 node cluster in each data center could handle this, but has not been fully planned out. I'm not entirely sure yet if this is worth it though. Likely this new cluster even with only 4 servers would be mostly idle.

I'm guesstimating that the current cluster in codfw and the planned upgrades to eqiad will leave us with the ability to double our usage from the current levels, but we may need additional capacity to cover not only 12 months but the full 18 months between now and the end of the next fiscal year. Due to the way sharding works in elasticsearch we would need to increase server counts to a multiple of the enwiki (as user of >50%, probably >75%, of all search server resources) shard count. We can adjust this shard count to 6, 7 or 8. Currently enwiki is being adjusted to 6 to fully utilize the 24 nodes in codfw.

Servers involved in serving enwiki by number of shards

based on current cluster sizes:

6 shards: eqiad: 30 codfw: 24
7 shards: eqiad: 28 codfw: 21
8 shards: eqiad: 24 codfw: 24

The current plan to adjust enwiki from 7 shards to 6 allows two more servers in the eqiad cluster to participate in serving enwiki, an increase of ~7%.

Next step up and % increase from current resources (from 6 enwiki shards):

6 shards: eqiad: 36 (30%) codfw: 30 (25%)
7 shards: eqiad: 35 (25%) codfw: 28 (17%)
8 shards: eqiad: 32 (18%) codfw: 32 (33%)

Sorry if this is all a bit rambling...just trying to get all the ideas out there so we can discuss possibilities.

[1] http://graphite.wikimedia.org/S/q

EBernhardson renamed this task from Elasticsearch capacity planning FY2016-17 to Elasticsearch health and capacity planning FY2016-17.Jan 25 2016, 4:55 AM
EBernhardson added a subscriber: Tfinc.

I think a reasonable target to set for server load is as follows:

100% increase in 12 months, to match the server load increases in the last 12 months
Eqiad showed marked increases in search latency with servers at ~80% cpu usage, so each datacenter needs to have the ability to serve all expected query traffic 18 months from now at a cpu usage of <= 70%.

EBernhardson updated the task description. (Show Details)Jan 25 2016, 5:08 AM
EBernhardson added a comment.EditedJan 25 2016, 7:46 PM

If my math is right, a 100% increase in 12 months extrapolated to 18 months gives

current capacity = 1
increase in 12 months: 2x
increase in following 6 months: 1.5x
total increase: 3x

My gut says a 3x increase in usage over 18 months is too much, but i'm uncertain.

We should take into account that a notable increase in resource usage has been due to the reading teams expanded usage of the 'more like' feature. This is currently a beta feature so if it rolls to full production we should expect a large increase in usage. This feature is currently uncached though, we are currently looking into (T124216) caching these results for 24 hours which is suspected to cut the number of more like requests to 1/5 of current usage.

Also coming up is the geodata queries. These arn't quite as expensive as morelike, but are much more expensive than typical full text search. They are currently used very little but reading has expressed interest in integrating geo data search as part of a 'nearby' feature.

dcausse added a comment.EditedJan 26 2016, 3:15 PM

Yes it's extremely hard to guess, the morelike problem makes it hard to evaluate.

Cluster wide:
Without serving morelike queries tp95 starts to move at 1200qps (prefix), and 600qps (fulltext).
Serving morelike on the same cluster tp95 starts to move at 900qps (prefix), and 500qps (fulltext).
tp95 for morelike queries when they are run alone in codfw does not seem to move.
But when they are added to eqiad all tpXX move whatever qps we serve...

On a per node basis:
For newer nodes (elastic1016+ and elastic2*) since we can't record tp for individual nodes this is a rough estimation based on avg query time:

  • prefix more than 450qps on codfw nodes => avg time is more than 4ms
  • fulltext more than 150qps
  • morelike: more than 30qps

A 3x increase would mean spikes at 9000qps (6600 prefix)
6600/(replicas+1) = 450
replicas = 6600/450 = 14
With a number of shard set to 7 it's a cluster of 98 machines, 6 shards => 84 machines ...
This is a more than rough estimation with existing morelike, consider a very high margin of error :(

Without morelike I think one node can serve 450*1.33 ~= 600qps prefix with constant tp95 (unverified)
so replicas = 6600/600 => 11
7*11 => 77 machines, 6*11 => 66 machines

If we don't care about tp95 I think a node can serve between 900 and 1000qps prefix before dying.
6600/1000 => 7
7*7 => 49 machines or 6*7 => 42 machines
Without morelike if we apply the 1.33 ratio we could in theory set 1300 qps prefix per node:
6600/1300 => 5
7*5 => 35 machines or 6*5 => 30 machines

So yes 3x increase seems like something we cannot afford especially if we care about tp95 (with 7 shards).
I think the answer for 3x increase is something between:

  • bad tp95 with current morelike usage: replace old nodes (16) and add +11 in eqiad, +18 in codfw: total per datacenter, +16 to replace old nodes:
    • total machines to buy: 45
  • bad tp95 without current morelike usage: replace old nodes and +4 in eqiad, +11 in codfw
    • total machines to buy: 31
  • constant tp95 with current morelike usage: replace old nodes and add +67 in eqiad, +74 in codfw:
    • total machines to buy: 157
  • constant tp95 without current morelike usage: replace old nodes and add +46 in eqiad, +53 in codfw:
    • total machines to buy: 115

I'm not sure that these numbers are very helpful, it's extremely complicate to do precise estimation :(

Side note: I think we can really optimize server usage by splitting cluster by feature.
From what I understand in this paper[1]: parallelization of slow queries can really help, they claim to save 1/3 of the total production servers.
We could work on query parallelization at elastic level but I understand also that we could somewhat optimize server utilization by splitting clusters:

  • One cluster for search type ahead (prefix search/completion): expected latency per node: 3-5 ms
  • One cluster for regular fulltext search: expected latency per node: 20-50 ms
  • One cluster for slow queries (morelike, regex): expected per node : ~200ms

On the other hand splitting by feature will add a significant overhead to operation maintenance time :(

[1]: http://research.microsoft.com/pubs/225008/paper.pdf

TJones added a subscriber: TJones.Jan 26 2016, 7:14 PM

David mentioned this ticket, and I had to take a peek.

If my math is right, a 100% increase in 12 months extrapolated to 18 months gives
current capacity = 1
increase in 12 months: 2x
increase in following 6 months: 1.5x
total increase: 3x
My gut says a 3x increase in usage over 18 months is too much, but i'm uncertain.

Math Nerd says: 100%/yr is very close to 6%/mo. Extrapolating to 18 mo gives 1.06^18 = ~2.85x. Not quite 3x, but close. OTOH, 5%/mo is ~2.4x at 18 months, and 7% is ~3.4x at 18 months. Estimating this kind of thing is really hard, even with pretty good data.

Two options come to mind: first, see if we have better data, and if so get Oliver or Mikhail to do a proper analysis (may not be data, may not be time); second, split the difference and stick with the 100%/yr estimate and plan for it, but also start a serious optimization thread in our goals and projects over the next 18 months. There seem to be some great ideas here already—like caching and cluster optimization—and we could probably come up with more, including optimizing the most expensive parts of Cirrus/Elastic (and gather the data to figure out what those are). A reasonably successful optimization effort would be justified by the hardware cost savings.

</2¢>

In my previous life managing search clusters we split our corpus by geographic location, then function (full text, prefix, etc) and then partitioned the data to fit into the nodes that we had available (P1, P2, etc). This allowed us to get much more predictable estimates for query patterns relative to year over year traffic trends. Since languages aren't bound tightly to geographic locations language becomes the next best estimate. Given that #3 is the most enticing following by #2.

Splitting by feature is really interesting as it would allow us to split out cost by highest cpu/load count. But, that assumes that over the next 18months we don't change/update/remove any of those features.

That's a high risk to take as we're actively looking at new ways of making search better. We'll have a better understanding of it as the year moves on.

I think either one could work and this is great conversation about where we can improve. Hoping we can hear from @Joe before our core budget is submitted.

EBernhardson added a comment.EditedJan 28 2016, 7:04 AM

I've completed load testing of the new codfw cluster, including testing some changes we are planning to make to CirrusSearch. Quick top-level stats as follows:

Cluster has 24 nodes, 768 cores

conditionpeak qps/nodepeak cluster qpspeak cluste qps / nodes
current1.1k-1.2k26k1k/s
more like: opening_text1.5k-1.7k35k1.4k/s
more like: 24h cache and opening_text1.8k-2k39k1.6k/s

We are guesstimating that peak loads 18 month from now might be 9k user queries per second. This estimate might be a bit high, because it includes the load increases from more like which we are in the process of optimizing. I don't know to what level it should be reduced to take that into account though. I'm just going to keep using 9k for the moment. Our current peak average fan-out of user queries is 5, meaning 9k user queries translates into 45k elasticsearch queries. We could use a value of 6 to be safe, which is the number of shards enwiki is running, giving a nice even 50k elasticsearch queries per second.

Server counts here are rounded up to the next multiple of 6 to match enwiki sharding.

conditionservers necessary per clusterqueries per server
current481k/s
more like: opening text361.4k/s
more like: 24h cache and opening text301.7k/s

With the change to opening text, and if caching of more like works out to the 80% reduction in query load we have estimated, We may be able to get away with 30 servers per cluster. This would keep eqiad the same size as it is currently, and codfw would need to be expanded by 6 servers.

Unfortunately elasticsearch does not report percentiles, so I don't have any information about how the 99th percentile moves in this load test. Being a little more concervative we could go with 36 servers in each cluster, meaning 5 more in eqiad and 12 more in codfw, for a total of 17 additional servers.

I should also note extrapolating from this load test makes the assumption that the ratio of query types does not change. An increase of geo data usage, or if caching of more like queries doesn't work out as well as we have guessed, would vary from this estimate. The estimate of 9k user queries/s also doesn't take into account reductions due to caching more like though. Usage of more like could likely be pushed back against with longer cache times, but my gut says increased usage of the geo data (nearby) search will be difficult to cache.

We additionally still need to determine what to do about splitting the clusters into two per datacenter. Tomorrow i will try and do some estimates on disk space requirements, splitting by feature would mean we need to keep multiple copies of the same indices in different clusters. It would also mean that indexing costs can't be spread between the clusters, they would be repeated on both clusters.

@RobH Would you be able to chime in here with a ballpark estimate of per server costs? Like WDQS this isn't for immediate purchase, this is to include in the budget for next FY. Estimating nodes would be added in Q2 (november-ish). We are looking at likely either 6 or 17 new nodes (tbd).

Elasticsearch nodes are:

Not sure what kind of server they are
32 cores
128G memory

eqiad has 2x400G SSD's
codfw has 2x800G SSD's

I would prefer to stick with 800G SSD's in any new nodes, The older nodes with 2x400G disks are at ~75% capacity which doesn't leave a ton of room for growth in data size. In terms of which cpu's, codfw currently uses Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz. As our load tests show cpu being the limiting factor i would prefer to stay at or above this level.

If we were to split the cluster by wiki our disk usage should stay pretty consistent with growth over the last year, merely split between clusters. If we were to split by feature (expensive vs cheap) we will end up using a good bit more disk space as each cluster needs it's own replicated copies of the data.

Current elasticsearch disk space usage:
Primary data only: 3TB
1p+2r: 9TB

Current disk space in eqiad: 493GB * 31 servers = 15TB
Current disk space in codfw: 705GB * 24 servers = 17TB

Unfortunately elasticsearch isn't great at balancing this data size across the cluster. Disk utilization on a per node basis varies from 30% to 80% across the cluster.

If we were to split off a cluster for expensive queries we would want to maintain the same 1p+2r setup, basically keeping 9TB plus room for growth in each side. Because these are SSD's, which perform best when they have some free blocks to shuffle data around to when performing writes, we likely want to aim for <= 75% disk utilization, for per cluster disk space of 12TB (for elasticsearch, the nodes additionally need some space for the OS).

Indexing load looks pretty low, the codfw cluster when only performing indexing sits at around 2-3% cpu usage. When we add the full set of more like queries from production to codfw cpu utilization pushes up to 11-12% at peaks, suggesting we could serve the current more like load with maybe 8 machines and have enough overhead for 2-3x increase in queries over the next 18 months (very rough guess). At 8 machines maintaining a 1p+2r setup will be quite difficult. Bumping to 9 machines would be only 3 shards which is still a bit low. We could consider lifting the limit of 1 shard per node to 2 or 3 shards per node, the main issue there though would be that elasticsearch doesn't do a great job of evenly distributing these shards around the cluster.

Going with the 9 machine idea, we would need around 1.33TB (12TB/9nodes) of disk space per node. The older machines in eqiad are running 2x300G SSD's in a raid0 for an effective 600GB of disk (493G available to elasticsearch. The newer machines in codfw were setup slightly differently, they are using 2x800G SSD's in a raid1 which gives 705GB per node available to elasticsearch. To get 1.5TB per node (for a hypothetical 9 machine cluster for expensive queries) eqiad machines would either need 4x800G in RAID1 or 2x800G in RAID0. I'm not sure yet which way would be preferable there. In terms of data reliability RAID1 vs RAID0 isn't a big deal, we keep multiple copies of the data around the cluster to deal with this.

I'm not sure how to estimate the reduced load from moving expensive queries off the main cluster. We've seen in load testing that making optimizations to more like was able to increase per-node QPS from 1.1-1.2k/node to 1.8-2k per node. It seems plausible to assume splitting expensive queries off from the main cluster could allow for a 2x increase in queries we can serve from the "fast query" cluster, reducing the size of the cluster serving faster queries by 33%-50% from the current size. We additionally know that, currently, more like queries are ~ 4% of the total queries served. I'll estimate that our split of the estimated 9k peak user queries per second would be 8k fast and 1k slow. We currently see peaks of ~150 more like QPS, so 1k might be a bit high. These are multiplied by 6 to estimate fan out to shards. Again the node counts are rounded up to multiples of 6 nodes.

nodesdisk totaldisk/nodeQPS/node
"fast" query cluster3012TB400GB1.6k/s
"slow" query cluster912TB1.33TB666/s

I feel pretty comfortable with the "fast" query cluster, but i'm uncertain about the slow cluster.

dcausse added a comment.EditedFeb 1 2016, 5:58 PM

I agree: splitting by feature is not easy and maybe not appropriate for the moment.

Random thoughts:
In the future I think one strategy could be:
1/ cluster for search type ahead (completion + prefix). Small disks but fast CPUs.
2/ cluster for regular/expensive searches

Index size for search type ahead should be rather small, we could maybe merge data needed for prefix search into the titlesuggest index. Realtime is not very important for this kind of query so this cluster could perform at optimal speed (no writes/no merges/only fast queries).

We could maybe re-open this discussion (cluster/feature) if we have problems with tp99 for search type ahead?

Gehel added a subscriber: Gehel.Feb 1 2016, 6:40 PM
EBernhardson moved this task from Needs triage to Ops on the Discovery board.Feb 11 2016, 11:21 PM
EBernhardson closed this task as Resolved.Sep 29 2016, 4:19 PM

capacity planning is completed. Closing this out.