Decommission elastic2037-2054
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Mar 1 2024, 5:28 PM

Description

Hosts elastic2037-2054 have reached the end of their 5-year lifespan.

Creating this ticket to track their decommissioning.

Details

Other Assignee: RKemper

Subject	Repo	Branch	Lines +/-
elastic: decom elastic20[37-54]	operations/puppet	production	+0 -17
elasticsearch: remove soon-to-be-decommed codfw hosts	operations/puppet	production	+27 -79
elastic: replace some masters	operations/puppet	production	+7 -8
elastic: move elastic2037 to insetup	operations/puppet	production	+5 -1
elastic-codfw: Add new master-eligibles	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T353392 Ensure Elastic stack works on bookworm
Resolved		bking	T353878 Service implementation for elastic2087-2109
Resolved		bking	T358882 Decommission elastic2037-2054
Resolved	Request	bking	T313842 Decommission elastic2049.codfw.wmnet
Resolved	Request	Jhancock.wm	T361305 decommission elastic20[37-54].codfw.wmnet

Event Timeline

bking created this task.Mar 1 2024, 5:28 PM

Gehel triaged this task as High priority.Mar 5 2024, 10:42 AM

bking mentioned this in T359742: Degraded RAID on elastic2037.Mar 19 2024, 8:56 PM

bking added a subtask: T313842: Decommission elastic2049.codfw.wmnet.Mar 20 2024, 6:37 PM

Change #1013398 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

gerritbot added a project: Patch-For-Review.Mar 21 2024, 8:13 PM

Change #1013398 merged by Bking:

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Change #1013401 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: move elastic2037 to insetup

https://gerrit.wikimedia.org/r/1013401

Change #1013401 merged by Bking:

[operations/puppet@production] elastic: move elastic2037 to insetup

https://gerrit.wikimedia.org/r/1013401

Mentioned in SAL (#wikimedia-operations) [2024-03-22T06:22:19Z] <ryankemper> T358882 Updating cross-cluster seeds to bring into concordance with newly added masters: ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

Mentioned in SAL (#wikimedia-operations) [2024-03-22T06:33:10Z] <ryankemper> T358882 Also updated cross-cluster seeds for ports 9243 and 9443. Everything should be as expected now.

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 22 2024, 8:46 AM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:47 AM

bking closed this task as Invalid.Mar 26 2024, 4:49 PM

bking reopened this task as In Progress.

bking claimed this task.

bking lowered the priority of this task from High to Medium.

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

bking updated Other Assignee, added: RKemper.

Change #1014600 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: replace some masters

https://gerrit.wikimedia.org/r/1014600

Change #1014600 merged by Ryan Kemper:

[operations/puppet@production] elastic: replace some masters

https://gerrit.wikimedia.org/r/1014600

Mentioned in SAL (#wikimedia-operations) [2024-03-26T20:09:20Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: cycle some masters - ryankemper@cumin2002 - T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-26T21:45:32Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: cycle some masters - ryankemper@cumin2002 - T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-27T01:06:46Z] <ryankemper> T358882 Updated remote cluster seeds for new master state

Mentioned in SAL (#wikimedia-operations) [2024-03-27T01:35:23Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2037*,elastic2038*,elastic2041*,elastic2042*,elastic2045*,elastic2046*,elastic2047*,elastic2050*,elastic2051*,elastic2052*,elastic2039*,elastic2040*,elastic2043*,elastic2044*,elastic2048*,elastic2053*,elastic2054* for prepare for decom of hosts - ryankemper@cumin2002 - T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-27T01:35:28Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2037*,elastic2038*,elastic2041*,elastic2042*,elastic2045*,elastic2046*,elastic2047*,elastic2050*,elastic2051*,elastic2052*,elastic2039*,elastic2040*,elastic2043*,elastic2044*,elastic2048*,elastic2053*,elastic2054* for prepare for decom of hosts - ryankemper@cumin2002 - T358882

elastic2037 is reported by Netbox for not being anymore in puppetdb, please either decommission it or shut it down. No host should be powered on without puppet running for extensive period of time.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:51:40Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on elastic2038.codfw.wmnet with reason: T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:51:44Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on elastic2038.codfw.wmnet with reason: T358882

Change #1015119 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: remove soon-to-be-decommed codfw hosts

https://gerrit.wikimedia.org/r/1015119

Change #1015119 merged by Bking:

[operations/puppet@production] elasticsearch: remove soon-to-be-decommed codfw hosts

https://gerrit.wikimedia.org/r/1015119

Change #1015123 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic20[37-54]

https://gerrit.wikimedia.org/r/1015123

Change #1015123 merged by Bking:

[operations/puppet@production] elastic: decom elastic20[37-54]

https://gerrit.wikimedia.org/r/1015123

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2052-2054].codfw.wmnet

elastic2052.codfw.wmnet (FAIL)
- Missing DNSName in Nebox for elastic2052, unable to verify it.
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.3.58
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2053.codfw.wmnet (FAIL)
- Missing DNSName in Nebox for elastic2053, unable to verify it.
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.3.59
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2054.codfw.wmnet (FAIL)
- Missing DNSName in Nebox for elastic2054, unable to verify it.
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.3.60
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic2037.codfw.wmnet

elastic2037.codfw.wmnet (FAIL)
- Host not found on Icinga, unable to downtime it
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB