Page MenuHomePhabricator

Explore ways to reduce complexity of OpenSearch environment (specific suggestions below)
Open, MediumPublic

Description

Search Platform currently runs 3 separate OpenSearch clusters, 2 per host. .

We split the production OpenSearch cluster into 3 (ref T193654 ). While this helped us avoid a scale issue around number of shards/indices, it also added complexity to the day-to-day management. Puppet code, cookbooks, and load balancer config are just a few places where we've been bitten.

While we shouldn't change a fast and stable system just because it's difficult to manage, I do think there's room to discuss changes.

Creating this ticket to:

  • Collect ideas on how to reduce complexity.
  • Discuss feasibility of each idea and decide whether or not to move forward.

Ideas (feel free to add yours):

  • Re-integrate the three clusters into one (it's possible OpenSearch has gotten better at managing large amounts of shards)
  • Run all 3 clusters on each host
  • Move each cluster to its own discrete VM or container

T192972 and T215969 have tests that we should probably revisit if we do consider re-integrating the clusters.

Event Timeline

bking renamed this task from Consider collapsing Cirrus down into a single cluster to Explore ways to reduce complexity of OpenSearch environment (specific suggestions below).May 20 2025, 9:43 PM
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)
bking added subscribers: TJones, Gehel, pfischer and 2 others.
Gehel triaged this task as Medium priority.Jun 20 2025, 8:20 AM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.