Page MenuHomePhabricator

Opensearch on K8s: test common operations
Closed, ResolvedPublic

Description

Per parent ticket, we need to have confidence that typical Kubernetes operations work when the OpenSearch operator is used.

Creating this ticket to work through the following operations and record results:

  • Minikube environment:
    • Change source image for OpenSearch cluster
    • Change number of replicas
    • Change resources (requests/limits)
    • Change size of PV
  • WMF Cluster
    • Change source image for OpenSearch cluster
    • Change number of replicas
    • Change resources (requests/limits)
    • Change size of PV
    • rolling restart completed
    • Drain a Kubernetes node and check that the OS cluster tolerates it nicely
    • Evict the current cluster manager and check that a new one is elected

Event Timeline

bking changed the task status from Open to In Progress.Aug 26 2025, 2:12 PM
bking triaged this task as High priority.

Marking this is subtask of T397246, as we'll need to confirm all of the above operations before we consider the cluster "deployed".

Something else I noticed (which shouldn't be too surprising) is that the persistent volumes are sticking around even when I destroy the cluster. That means that when I redeploy, the new cluster won't bootstrap because it still has the data for the old cluster.

I'm working on a way to wipe out the old PVs next. It's possible we could figure out a way to bring back an existing cluster that's been accidentally destroyed, but OpenSearch on k8s is not to be used as a primary datasource for anything. In that spirit, I'm not going to work on an existing cluster recovery procedure anytime soon.

Change #1223228 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-ipoid: Expand pod disk size from 30->40 GB

https://gerrit.wikimedia.org/r/1223228

Change #1223228 merged by Bking:

[operations/deployment-charts@master] opensearch-ipoid: Expand pod disk size from 30->40 GB

https://gerrit.wikimedia.org/r/1223228

There are a couple of unfinished operations on this board:
Change number of replicas: The current design provisions exact 3 pods, so we don't really need to test this.
Drain a Kubernetes node and check that the OS cluster tolerates it nicely: We have existing workarounds for this if need be (delete a pod, storage is not deleted so downtime is minimal).

We'll repeat these same tests for the new OpenSearch operator in T414217. For the current OpenSearch operator deploy, we are finished. Closing...