Page MenuHomePhabricator

Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch
Open, In Progress, HighPublic

Description

We need to ensure our cookbooks can still handle common operations (banning nodes, restarting services, etc) during and after our migration from Elastic -> OpenSearch. As noted in T391151, our current Elastic/OpenSearch cookbooks don't always meet our needs. The elasticsearch-python library dependencies in Spicerack can also affect other teams, see T390860 and this CR for examples.

Creating this ticket to:

  • Examine existing cookbooks
    • ban.py
    • force-shard-allocation.py
    • force-unfreeze.py
    • reset-read-only.py
    • restart-nginx.py
    • rolling-operation.py
  • Identify potential improvements (use other teams' OpenSearch cookbooks? Use python requests instead of ES/OS python libraries? Move the logic out of Spicerack completely?)
  • Implement the improvements

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bking updated Other Assignee, added: RKemper.

Per today's conversation with my teammates in Data Platform SRE, we have decided that the rolling-operation cookbook provides enough value in terms of orchestrating the batching and healthchecks that it should be used to carry out the migration.

As such, I'm setting this task as a blocker to the production migration ( T388610 ).

Change #1129373 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] rolling-operation.py: Put back a reference to nodes.start_elasticsearch()

https://gerrit.wikimedia.org/r/1129373

Change #1129373 merged by Ryan Kemper:

[operations/cookbooks@master] rolling-operation.py: Put back a reference to nodes.start_elasticsearch()

https://gerrit.wikimedia.org/r/1129373

Per the above patch, rolling-operation.py works...however, I did have a timeout when I used the --allow-yellow flag: Error while waiting for yellow with no initializing or relocating shards . The cluster was in green, which makes me wonder if the cookbook is checking for ' yellow and no relocating shards'. We want 'green' (unqualified) OR 'yellow and no relocating shards.', so that 'green and no relocating shards' is also a valid status. Will follow up on this hunch tomorrow.

Icinga downtime and Alertmanager silence (ID=b3f79c36-c233-46b5-bfaf-6d5a09ed70df) set by bking@cumin2002 for 4:00:00 on 5 host(s) and their services with reason: troubleshooting red status

cloudelastic[1007,1009-1012].eqiad.wmnet

I ran the cookbook without the allow-yellow flag: sudo cookbook sre.elasticsearch.rolling-operation cloudelastic "try rolling operation without allow-yellow flag" --restart --task-id T389119 --nodes-per-run 1

and it still caused a red status on Cloudelastic. Until we can figure out what's going wrong, we need to either:

  • migrate without the rolling-operation cookbook, as originally planned in T388610
  • hold up the migration until we can fix the issue

Adding a few notes from pairing with @RKemper today:

  • We repeated the rolling operation today with the same arguments as above, and got the exact same results (red status).
  • The operation seems to work OK for the first 2 hosts, then all clusters go red when it gets to the third host.
  • I sent the command 'for n in $(cat /etc/opensearch/instances); do systemctl start $n; done' to all hosts when we were in red status, but I don't think it actually helped anything. I'm pretty sure it recovered on its own.
  • We noticed that there are unit files for both opensearch_1@cloudelastic-eqiad.service and opensearch_1@cloudelastic-chi-eqiad.service. . The unit opensearch_1@cloudelastic-eqiad.service was showing in failed state. I don't think we want both of these. red herring, you can trigger this behavior by passing systemctl start any-invalid-unit-file-name.service, which will fail and set off alerts. So I probably created this problem by trying to systemctl start opensearch_1@cloudelastic-eqiad.service (which doesn't actually have a unit file).

Further investigation is necessary.

Change #1130624 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: DO NOT MERGE (just for PCC)

https://gerrit.wikimedia.org/r/1130624

Change #1130624 abandoned by Bking:

[operations/puppet@production] cloudelastic: DO NOT MERGE (just for PCC)

Reason:

figured out what I needed from this. closing...

https://gerrit.wikimedia.org/r/1130624

Change #1130701 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: replace failed master-eligible host

https://gerrit.wikimedia.org/r/1130701

Change #1130701 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: replace failed master-eligible host

https://gerrit.wikimedia.org/r/1130701

Change #1132024 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: enable rack awareness

https://gerrit.wikimedia.org/r/1132024

Change #1132024 merged by Bking:

[operations/puppet@production] relforge: enable rack awareness

https://gerrit.wikimedia.org/r/1132024

Following up, the above issue was due to cluster quorum and didn't have anything to do with the cookbook.

There are a few things we need to address in Spicerack before we can move forward with the migration:

  • --allow-yellow , which enables the check_yellow_w_no_moving_shards function , doesn't quite fit our scenario. We'll need a few function that allows the cluster to restart when we have UNASSIGNED ALLOCATION_FAILED replica shards. This is because OpenSearch hosts can't replicate their primary shards to Elasticsearch.

Fixed thanks to @RKemper

Lower priority/not a migration blocker:

  • --allow-yellow needs BOTH no relocating shards AND yellow status, it doesn't seem to work with no relocating shards AND green status.

Change #1133967 had a related patch set uploaded (by Bking; author: Bking):

[operations/software/spicerack@master] WIP: more fine-grained shard status checks

https://gerrit.wikimedia.org/r/1133967

Change #1134043 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: add puppet 7 hieradata to DC-specific config

https://gerrit.wikimedia.org/r/1134043

Change #1134043 merged by Bking:

[operations/puppet@production] cirrussearch: add puppet 7 hieradata to DC-specific config

https://gerrit.wikimedia.org/r/1134043

Change #1133967 abandoned by Bking:

[operations/software/spicerack@master] WIP: more fine-grained shard status checks

Reason:

not needed after all

https://gerrit.wikimedia.org/r/1133967

Change #1131446 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks

https://gerrit.wikimedia.org/r/1131446

Change #1131446 merged by Bking:

[operations/cookbooks@master] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks

https://gerrit.wikimedia.org/r/1131446

Change #1135133 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage

https://gerrit.wikimedia.org/r/1135133

Change #1135133 merged by Ryan Kemper:

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage

https://gerrit.wikimedia.org/r/1135133

Change #1135826 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: don't use http for dhcp for reimage

https://gerrit.wikimedia.org/r/1135826

Change #1135826 merged by Bking:

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname

https://gerrit.wikimedia.org/r/1135826

Change #1136796 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: refactor external cookbook invocations

https://gerrit.wikimedia.org/r/1136796

Change #1136796 merged by Ryan Kemper:

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: refactor external cookbook invocations

https://gerrit.wikimedia.org/r/1136796

Mentioned in SAL (#wikimedia-operations) [2025-05-28T13:19:38Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-05-28T13:20:17Z] <bking@cumin2002> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002

Change #1151797 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.elasticsearch: remove unused cookbooks

https://gerrit.wikimedia.org/r/1151797

Mentioned in SAL (#wikimedia-operations) [2025-06-06T17:06:55Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-06-06T17:08:51Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-06-06T17:20:21Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-06-06T19:11:22Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002

Blocking on T390860, as this changes associated with that task will affect all of the above cookbooks.

bking removed bking as the assignee of this task.Jul 8 2025, 2:52 PM
bking updated Other Assignee, removed: RKemper.