decommission elastic10[18-31].eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	Gehel
	Dec 4 2019, 2:31 PM

Details

Subject	Repo	Branch	Lines +/-
Removing mgmt dns for asset tags associated w/elastic1020-1031	operations/dns	master	+0 -21
Removing mgmt dns asset tags associated w/elastic1017-1020	operations/dns	master	+1 -8
sre.hosts.decommission: avoid race condition	operations/cookbooks	master	+4 -0
elasticsearch: decommission elastic[1018-1031]	operations/dns	master	+0 -26
elasticsearch: decommission elastic10[18-31]	operations/puppet	production	+0 -77
search: decommission elastic10[18-31].eqiad.wmnet	operations/puppet	production	+15 -52

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T221630 [Epic] Search platform - Hardware requests for 2019-2020
		Resolved	Request	Jclark-ctr	T239821 decommission elastic10[18-31].eqiad.wmnet

Event Timeline

Gehel created this task.Dec 4 2019, 2:31 PM

Gehel added a parent task: T221630: [Epic] Search platform - Hardware requests for 2019-2020.

Gehel renamed this task from decommission elastic10[17-31].eqiad.wmnet to decommission elastic10[18-31].eqiad.wmnet.Dec 4 2019, 2:42 PM

Gehel updated the task description. (Show Details)

Change 554517 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] search: decommission elastic10[18-31].eqiad.wmnet

https://gerrit.wikimedia.org/r/554517

gerritbot added a project: Patch-For-Review.Dec 4 2019, 2:47 PM

Change 554517 merged by Gehel:
[operations/puppet@production] search: decommission elastic10[18-31].eqiad.wmnet

https://gerrit.wikimedia.org/r/554517

cookbooks.sre.hosts.decommission executed by gehel@cumin1001 for hosts: elastic1018.eqiad.wmnet

elastic1018.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by gehel@cumin1001 for hosts: elastic[1019-1020,1022-1031].eqiad.wmnet

elastic1019.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1020.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1022.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1023.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1024.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1025.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1026.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1027.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1028.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1029.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1030.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
elastic1031.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Wiped bootloaders
- Powered off
- Set Netbox status to Decommissioning
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Change 558521 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: decommission elastic10[18-31]

https://gerrit.wikimedia.org/r/558521

Change 558521 merged by Gehel:
[operations/puppet@production] elasticsearch: decommission elastic10[18-31]

https://gerrit.wikimedia.org/r/558521

Maintenance_bot removed a project: Patch-For-Review.Dec 17 2019, 2:10 PM

Change 558525 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/dns@master] elasticsearch: decommission elastic[1018-1031]

https://gerrit.wikimedia.org/r/558525

gerritbot added a project: Patch-For-Review.Dec 17 2019, 2:18 PM

Change 558525 merged by Gehel:
[operations/dns@master] elasticsearch: decommission elastic[1018-1031]

https://gerrit.wikimedia.org/r/558525

Gehel assigned this task to Jclark-ctr.Dec 18 2019, 8:48 AM

Gehel updated the task description. (Show Details)

Gehel moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

Maintenance_bot removed a project: Patch-For-Review.Dec 18 2019, 9:10 AM

TJones closed this task as Resolved.Dec 18 2019, 4:36 PM

I noticed that elastic1019 is still in puppetdb*, maybe the decom cookbook failed there?

jmm@cumin2001:~$ sudo cumin elastic101*
1 hosts will be targeted:
elastic1019.eqiad.wmnet
DRY-RUN mode enabled, aborting

@MoritzMuehlenhoff mmmh, according to T239821#5747654 it all worked fine. LMK if I should investigate.

I had a quick look, it's the rare race we've seen before: The cookbook executed the "puppet node deactivate" at 13:51:32 and at 13:51:34 elastic1019 submitted a last store report (which was initiated before the deactivate arrived). I'll re-run "puppet node deactivate" manually to clean this up.

Interesting, given that the new cookbook kills the hosts that was unexpected, but the cookbook is very quick so I get why it happens.
My suggestion is to add a small sleep (with a log line to tell the user) before this line https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py#167
Probably 10~30s should be enough to run the other actions after any in-flight action.

In T239821#5760345, @Volans wrote:

Interesting, given that the new cookbook kills the hosts that was unexpected, but the cookbook is very quick so I get why it happens.
My suggestion is to add a small sleep (with a log line to tell the user) before this line https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py#167

That sounds good to me! 20s is probably a good compromise.

Change 560381 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: avoid race condition

https://gerrit.wikimedia.org/r/560381

gerritbot added a project: Patch-For-Review.Dec 23 2019, 10:01 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-23T10:06:32Z] <moritzm> removing elastic1019 from puppetdb T239821

Change 560381 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: avoid race condition

https://gerrit.wikimedia.org/r/560381

Maintenance_bot removed a project: Patch-For-Review.Dec 23 2019, 10:10 AM

Gehel mentioned this in T233578: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet.May 8 2020, 8:04 PM

Change 597890 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns asset tags associated w/elastic1017-1020

https://gerrit.wikimedia.org/r/597890

gerritbot added a project: Patch-For-Review.May 21 2020, 11:15 PM

this has not been completed

Change 597890 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns asset tags associated w/elastic1017-1020

https://gerrit.wikimedia.org/r/597890

Change 597894 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns for asset tags associated w/elastic1020-1031

https://gerrit.wikimedia.org/r/597894

• Cmjohnson added a project: ops-eqiad.May 21 2020, 11:36 PM

• Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.

• Cmjohnson updated the task description. (Show Details)May 21 2020, 11:44 PM

Change 597894 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns for asset tags associated w/elastic1020-1031

https://gerrit.wikimedia.org/r/597894

Maintenance_bot removed a project: Patch-For-Review.May 29 2020, 1:10 PM

All of these servers have been removed from the racks, networks switch and moved to offline in netbox

Volans mentioned this in rCCKB74301defcd28: sre.hosts.decommission: avoid race condition.Dec 14 2022, 3:25 PM