Page MenuHomePhabricator

Create tools for banning/unbanning Elastic nodes
Closed, ResolvedPublic3 Estimated Story Points

Description

While working through T322082 , I had to ban and unban Elastic nodes. There is no tooling for this now, so the process is entirely manual, error-prone, and tedious. Creating this ticket to:

  • Create a cookbook, script, and/or playbook to quickly and safely ban nodes
  • Create an alert for nodes that remain banned past a certain time threshold

Event Timeline

MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.

Change 899789 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: modernize log msgs

https://gerrit.wikimedia.org/r/899789

Change 899789 merged by Ryan Kemper:

[operations/cookbooks@master] sre.elasticsearch.rolling-operation: modernize log msgs

https://gerrit.wikimedia.org/r/899789

API call: curl -s http://localhost:9200/_nodes/stats | jq .nodes[].attributes.row

Change 901298 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elastic: [WIP] Add node-banning cookbook

https://gerrit.wikimedia.org/r/901298

Will probably use this line as inspiration for expanding given input into discrete hosts.

Change 902502 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elasticsearch: [WIP] Add node ban logic

https://gerrit.wikimedia.org/r/902502

bking set the point value for this task to 3.Mar 27 2023, 3:27 PM
bking moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Quickly find the nodes using curl/jq: curl -s https://search.svc.eqiad.wmnet:9243/_nodes/stats | jq '.nodes[] | select(.attributes.row == "B") |.name' | sort

bking updated Other Assignee, added: RKemper.

Change 901298 abandoned by Ryan Kemper:

[operations/cookbooks@master] elastic: [WIP] Add node-banning cookbook

Reason:

abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/902502

https://gerrit.wikimedia.org/r/901298

Change 902502 merged by jenkins-bot:

[operations/cookbooks@master] elasticsearch: Add node ban logic

https://gerrit.wikimedia.org/r/902502

Change 910037 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elasticsearch: handle cloudelastic URLs

https://gerrit.wikimedia.org/r/910037

Change 910037 merged by jenkins-bot:

[operations/cookbooks@master] elasticsearch: handle cloudelastic URLs

https://gerrit.wikimedia.org/r/910037

Change 917944 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] sre.elasticsearch.ban: Improve error message

https://gerrit.wikimedia.org/r/917944

Change 917944 merged by Bking:

[operations/cookbooks@master] sre.elasticsearch.ban: Improve error message

https://gerrit.wikimedia.org/r/917944