Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T310990 [1 sprint Timebox] Investigate solution to the ElasticSearch scaling issues | |||
Resolved | Evelien_WMDE | T327430 Estimate Costs of Scaling ES without optimisation |
Event Timeline
As input for the scaling scenario's, very rough ballpark and just to get an idea:
This year we estimate 1000 Wikibases created on Wikibase Cloud. This is excluding deleted Wikibases.
I assume, considering the nature of the product, we won't go through any sort of skyrocketing growth, but rather steady/stable.
So would be great to see scaling scenario's of ~1000 by end of 2023, ~1700 by end of 2024, ~2500 by end of 2025.
Status Quo
The production cluster is currently composed of the following set of GCE instances:
No of instances | Type | monthly price per instance |
3 | e2-standard-4 | $126,04764 |
1 | e2-standard-2 | $63,02382 |
2 | e2-medium | $31,51191 |
which totals to $504,19 per month of cost for compute.
On top of that we pay roughly $100 per month for the GKE service (which is accrued across production and staging), but this position would remain unchanged if we'd scale up the cluster as described below.
ElasticSearch recommendations
The company behind ElasticSearch runs a hosted version of the software as a service[0].
This service can be run on different cloud providers (e.g. AWS, GCP, Azure).
There is documentation on the configuration used[1] which I will be using as a starting point here.
[0]: https://www.elastic.co
[1]: https://www.elastic.co/guide/en/cloud/current/ec-gcp-vm-configurations.html
Considering our use case, a gcp.es.datahot.n2.68x10x45 configuration is currently recommended[2].
This means, each ElasticSearch pod should ideally be provided 10 vCPUs and 68GB of RAM (note that this dwarfs our currently available resources) and run on an N2 instance type[3] (which has higher clock cycles).
I would assume this configuration recommendation is slightly overprovisioned in order to make ElasticSearch look snappy, but in the end this is what we also want.
[2]: https://www.elastic.co/guide/en/cloud/current/ec-gcp-configuration-choose.html
[3]: https://cloud.google.com/compute/docs/general-purpose-machines#n2_machines
Applying these recommendations to our cluster
A straight forward way of applying these recommendations to our Kubernetes cluster would be:
- use 3 nodes, each one of them hosting one ElasticSearch master
- all 3 nodes allow for the recommended resource consumption
- the remaining pods are distributed across these 3 nodes by Kubernetes itself
Option 1: The closest matching instance type here would be a n2-standard-16, which offers 16 vCPUs and 64GB of RAM.
This would give us the CPU we need, but we'd need to underprovision RAM in order to be able to run other pods on the same node.
Such an instance currently costs $584,58 a month, meaning we'd pay $1.753,20 in compute costs per month (plus $1.249,01).
Option 2: We use n2-highmem-16 instances, which offer 16 vCPUs and 128GB of RAM.
Applying the recommended configuration values for the ElasticSearch masters, we'd still have 18 vCPUs and 180GB left which should be enough to satisfy requirements of all other running services.
Such an instance currently costs $612,09 a month, meaning we'd pay $1.836,27 in compute costs per month (plus $1.332,08).
Considering the almost negligible difference in price, option 2 should be preferred as it frees us from having to think about RAM allocation all too hard.
Rightsizing our setup
Considering that the initial issues and the decision to selectively disable CirrusSearch based search for some tenants, it is close to impossible (at least with my limited knowledge) to reliably extrapolate today's requirements from what we currently know.
This means it's likely we'll have to gradually converge towards a setup that works well for us.
This can be done using two approaches:
Overprovision infrastructure, then scale down
In this approach, we'd scale up the cluster to be guaranteed to provide more compute resources than needed.
We then observe CPU / RAM / IO patterns and scale down the compute capacity again until we reach a desirable level of resource utilization.
Pros:
- It's relatively easy to determine the resources actually needed once we have an overprovisioned cluster.
- Users would get search from day 1 of the procedure.
Cons:
- This would probably increase costs drastically while in the migration phase.
- Can we even safely say what overprovisioned hardware is without choosing the biggest instance type?
Enable search features gradually while scaling up infrastructure resources
In this approach, we'd gradually scale up the resources available, while re-enabling search for smaller groups of users.
When we reach resource limits again, we further scale up resources.
This cycle will be repeated until all tenants have search enabled.
Pros:
- This is a cost-effective approach
- We'd get thorough insights on ElasticSearch performance characteristics
Cons:
- Might break service for (all) users at some point, or even multiple times
- This is a long running and time intensive procedure
Open questions
Extrapolating current requirements better
I find it very hard to extrapolate resource requirements from:
a. today's usage data
b. data from the incident in May
as there are a lot of moving parts involved and it does not seem to be a even close to linear behavior.
Then again, I am fairly new and might be missing even the most simple things.
Is there any obvious path we might want to go down in order to get "some" number.
ElasticSearch cluster topology
I don't know if the 3 master node setup is really required. Maybe we could get the same performance off a single ElaticSearch instance while losing data redundancy only (which would probably not be a hard requirement for us?).
How does this scale further
What do we do if such a 3 node setup seems to reach limits or seems to be wildly overprovisioned?
Hosted ElasticSearch seems to add / remove Nodes from a cluster in order to scale.
How would we model this in our Kubernets cluster?
Hypothesis 1: Extrapolating resource requirements from the number of shards
Looking into multiple factors that are creating bottlenecks in our usage of ElasticSearch, the high number of shards being created is definitely one of them. Under the hypothesis that this is how to solve the problem (which might not hold true), let's look at what would need to be done.
Observations
From running a few load tests locally and comparing my cluster metrics with production, I am under the impression that our limitations might be caused by the fact that we are creating a lot of shards, which is a hard requirement of our current design (unless we patch CirrusSearch itself and make it use multi-tenant indices).
This means that if we'd follow the ElasticSearch guideline for provisioning RAM per shard, which is:
Aim for 20 shards or fewer per GB of heap memory
The number of shards a data node can hold is proportional to the node’s heap memory. For example, a node with 30GB of heap memory should have at most 600 shards. The further below this limit you can keep your nodes, the better. If you find your nodes exceeding more than 20 shards per GB, consider adding another node.
our current setup (which is provisioning 4GB of heap) would be limited to hosting 80 shards. Currently, there seem to be 315 of them in production
➜ ~ curl -sSL "http://localhost:9200/_cluster/health" | jq production { "cluster_name": "elasticsearch", "status": "green", "timed_out": false, "number_of_nodes": 3, "number_of_data_nodes": 3, "active_primary_shards": 315, "active_shards": 945, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 100 }
Assumption: number of shards maps to the number of wikis
To extrapolate requirements, this comment will assume that the number of shards needed in ElasticSearch roughly equals the number of wikis. In case this relation is not correct, but there is still a linear mapping, the calculation can be adjusted.
Requirements
In order to accomodate a 1000 wikis, we'd create a 1000 shards, thus aiming for 50GB of heap memory. This roughly aligns with the ElasticSearch recommendations from the comment above, meaning scaling up to a 1000 users by adding beefier hardware only, we might get away with plus $1.332,08 per month.
1700 wikis would require 85GB of heap and the use of n2-highmem-32 nodes (this needs to be fine tuned further in order not to create excess resources).
2500 wikis would require 125GB of heap, which would also fit on n2-highmem-32 nodes (this needs to be fine tuned further in order not to create excess resources).
^^^ This picks up and extends on the comment above.
JVM memory considerations
According to this CNCF arcticle the JVM (Java Virtual Machine, which ElasticSearch is built upon) hits a "magic ceiling" at 64GB of RAM:
One last thing to know, there is a reason why you can’t have enormous nodes with elastic. ElasticSearch uses the JVM and requires a trio to compress object pointers when heaps are less than around 32Gb. So whatever happens, don’t allocate more than 32Gb (64Gb total) to your nodes
(N.B.: this slightly contradicts the ElasticSearch recommendation of 68Gb, but maybe they plan for 4Gb of OS level overhead here)
For us, this means, that we should not scale beyond n2-highmem-16 instance types (unless we start placing multiple ElasticSearch pods on the same node which sounds like an antipattern [maybe it isn't though]). Once the cluster has more than 1920 shards (3 nodes * 20 shards * 32 GB), the way forward would be to add another n2-highmem-16 instance and place a fourth ElasticSearch master node on it, making room for 2560 shards.
Effect on costs
Mixing this approach with the "number of shards maps to the number of wikis" hypothesis, scaling costs would look like:
# of wikis | cluster composition | costs / month | delta / month |
status quo with working search | 3 x n2-highmem-16 | $1.836,27 | $1.332,08 |
1.000 | 3 x n2-highmem-16 | $1.836,27 | $1.332,08 |
1.700 | 4 x n2-highmem-16 | $2.448,36 | $1.944,26 |
2.500 | 4 x n2-highmem-16 | $2.448,36 | $1.944,26 |
General approach for scaling the ElasticSearch cluster by means of hardware
Considering the limit of 64Gb of RAM for each node, scaling ElasticSearch from the status quo (3 master ES nodes with 8Gb of RAM each) would need to follow this pattern:
- scale by using beefier instance types until each of the 3 master ES nodes is allocated 64Gb of RAM (at k8s level)
- if we need to scale further, add more master (or data) nodes to the ES cluster, using an instance type that can provision 64Gb of RAM. There is close to no benefit in provisioning even bigger instances when the 64Gb limit has been reached.
Summary for Sprint Review
Scaling cost estimation
Based on the assumption that our main bottleneck with ElasticSearch is the high number of shards being created, we need to make sure ElasticSearch is provisioned a lot of RAM, potentially CPUs with higher clock cycles will also help.
Applying the recommendations of different sources, as well as my own observations to our existing GKE cluster, this would mean:
# of wikis | node composition | compute costs / month | delta / month |
status quo | 3 x e2-standard-4, 1 x e2-standard-2, 2 x e2-medium | $504,19 | $0,00 |
status quo with working search | 3 x n2-highmem-16 | $1.836,27 | $1.332,08 |
1.000 | 3 x n2-highmem-16 | $1.836,27 | $1.332,08 |
1.700 | 4 x n2-highmem-16 | $2.448,36 | $1.944,26 |
2.500 | 4 x n2-highmem-16 | $2.448,36 | $1.944,26 |
See the comments above for extended dicscussion about how we get to these numbers. Note that this is an educated guess that might be very wrong.
How do we get to the real number
If we decided to solve the problem by means of hardware, we need to pick an approach for rightsizing the real-word production cluster.
Option 1: Overprovision infrastructure, then scale down
In this approach, we'd scale up the cluster to be guaranteed to provide more compute resources than needed as the first step. We then observe CPU / RAM / IO patterns and scale down the compute capacity again until we reach a desirable level of resource utilization.
Pros:
- It's relatively easy to determine the resources actually needed once we have an overprovisioned cluster.
- Users would get search from day 1 of the procedure.
Cons:
- This would probably increase costs drastically while in the migration phase.
- Can we even safely say what overprovisioned hardware is without choosing the biggest instance type?
Option 2: Enable search features gradually while scaling up infrastructure resources
In this approach, we'd gradually scale up the resources available, while re-enabling search for smaller groups of users. When we reach resource limits again, we further scale up resources. This cycle will be repeated until all tenants have search enabled.
Pros:
- This is a cost-effective approach
- We'd get thorough insights on ElasticSearch performance characteristics
Cons:
- Might break service for (all) users at some point, or even multiple times
- This is a long running and time intensive procedure
Looks like it's a bit late (the end of January was rather busy) but if memory is the most important thing, n2d (AMD Rome/Milan) instances appear to be ~13% cheaper and might be worth considering in the future, depending on whether you're CPU-bound and how well ES performs.