Page MenuHomePhabricator

Ensure our automation can ban not-yet-existing hosts
Closed, ResolvedPublic

Description

As we migrate from Elastic -> OpenSearch, we need to ban the OpenSearch hosts before they exist. That's because primary shards hosted by OpenSearch nodes cannot replicate to Elasticsearch nodes. However, our current ban cookbook can't ban any nodes that aren't already in the cluster.

Creating this ticket to:

  • create automation that can ban hosts not already in the cluster
  • confirm operation

Event Timeline

Gehel triaged this task as High priority.Apr 7 2025, 1:36 PM
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.
bking changed the task status from Open to In Progress.Apr 7 2025, 2:01 PM
bking claimed this task.

Mentioned in SAL (#wikimedia-operations) [2025-04-07T16:08:19Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for test ban syntax - bking@cumin2002 - T391151

Mentioned in SAL (#wikimedia-operations) [2025-04-07T16:08:22Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for test ban syntax - bking@cumin2002 - T391151

Mentioned in SAL (#wikimedia-operations) [2025-04-07T16:30:10Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for test ban syntax - bking@cumin2002 - T391151

Mentioned in SAL (#wikimedia-operations) [2025-04-07T16:30:12Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for test ban syntax - bking@cumin2002 - T391151

I've written a playbook that can do the banning/unbanning.

We'll need to rewrite the ban cookbook at some point. That will involve a refactor of the elasticsearch.py library in Spicerack, as that code relies completely on information from the Elastic API and that won't work for this use case.

bking renamed this task from Ensure ban.py cookbook can ban not-yet-existing hosts to Ensure our automation can ban not-yet-existing hosts.Apr 15 2025, 7:11 PM
bking updated the task description. (Show Details)

Since we have a good-enough solution at the moment, I'm going to close this out. Long-term, we need to change our approach as the current Spicerack code is causing problems, and not just for us . I'll create a separate ticket for that issue. Closing...