Search Platform currently runs 3 separate OpenSearch clusters, 2 per host. .
We split the production OpenSearch cluster into 3 (ref T193654 ). While this helped us avoid a scale issue around number of shards/indices, it also added complexity to the day-to-day management. Puppet code, cookbooks, and load balancer config are just a few places where we've been bitten.
While we shouldn't change a fast and stable system just because it's difficult to manage, I do think there's room to discuss changes.
Creating this ticket to:
- Collect ideas on how to reduce complexity.
- Discuss feasibility of each idea and decide whether or not to move forward.
Ideas (feel free to add yours):
- Re-integrate the three clusters into one (it's possible OpenSearch has gotten better at managing large amounts of shards)
- Run all 3 clusters on each host
- Move each cluster to its own discrete VM or container
T192972 and T215969 have tests that we should probably revisit if we do consider re-integrating the clusters.