⚓ T268825 decommission es1017.eqiad.wmnet

	Subject	Repo	Branch	Lines +/-
	mariadb: Decommission es1017	operations/puppet	production	+0 -14
	es1017: Remove from dbctl	operations/puppet	production	+0 -1

• Marostegui assigned this task to LSobanski.Nov 26 2020, 1:43 PM

• Marostegui moved this task from Triage to Ready on the DBA board.

• Marostegui updated the task description. (Show Details)Nov 27 2020, 1:30 PM

I have depooled this host to give it a kernel upgrade for T264154 (I won't repool it anymore).

• Marostegui updated the task description. (Show Details)Dec 1 2020, 6:07 AM

This host was rebooted, and expected, never came back. The idrac also doesn't work...

• Marostegui updated the task description. (Show Details)Dec 1 2020, 6:16 AM

Change 644408 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es1017: Remove from dbctl

https://gerrit.wikimedia.org/r/644408

Change 644408 merged by Marostegui:
[operations/puppet@production] es1017: Remove from dbctl

https://gerrit.wikimedia.org/r/644408

Mentioned in SAL (#wikimedia-operations) [2020-12-01T06:54:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove es1017 from dbctl T268825', diff saved to https://phabricator.wikimedia.org/P13490 and previous config saved to /var/cache/conftool/dbconfig/20201201-065419-marostegui.json

• Marostegui updated the task description. (Show Details)Dec 1 2020, 6:54 AM

Maintenance_bot removed a project: Patch-For-Review.Dec 1 2020, 7:10 AM

• Marostegui moved this task from Backlog to Blocked on Service Owners on the decommission-hardware board.Dec 1 2020, 9:43 AM

• Marostegui claimed this task.Dec 1 2020, 9:46 AM

• Marostegui added a subscriber: LSobanski.

• Marostegui moved this task from Ready to In progress on the DBA board.Dec 2 2020, 6:14 AM

Change 644675 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Decommission es1017

https://gerrit.wikimedia.org/r/644675

gerritbot added a project: Patch-For-Review.Dec 2 2020, 6:16 AM

@Volans this host has the mgmt interface down (and most likely broken) so as expected, the boot loaders cannot be wiped, how should we proceed with those issues?

The host is off by the way, it never came back from a reboot a few days ago.

cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: es1017.eqiad.wmnet

es1017.eqiad.wmnet (FAIL)
- Failed downtime host on Icinga (likely already removed)
- Found physical host
- Skipped downtime management interface on Icinga (likely already removed)
- Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
- Failed to power off, manual intervention required: Remote IPMI for es1017.mgmt.eqiad.wmnet failed (exit=1): b''
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2020-12-02T06:54:50Z] <marostegui> Remove es1017 from tendril and zarcillo T268825

Change 644675 merged by Marostegui:
[operations/puppet@production] mariadb: Decommission es1017

https://gerrit.wikimedia.org/r/644675

Maintenance_bot removed a project: Patch-For-Review.Dec 2 2020, 7:10 AM

• Marostegui updated the task description. (Show Details)Dec 2 2020, 7:23 AM

@Marostegui currently the wipe of bootloaders is done from the OS, not the mgmt, so if the host is already down/broken it can't be done, but is not affected by the mgmt console not working. The step is done just to make sure the host can't come back and re-use its old configured IPs more than anything else. So if the host is already broken there is no real risk and it's ok as is. The actual disk wipe is still performed by DC-Ops separately.

Ah, excellent @Volans - thanks for clarifying that.

• Marostegui updated the task description. (Show Details)Dec 2 2020, 9:03 AM

Ready for DC-Ops

Restricted Application added a project: SRE. · View Herald TranscriptDec 2 2020, 9:06 AM

For the record, the homer run:

# homer asw2-c-eqiad* commit "T268825"
INFO:homer.devices:Initialized 35 devices
INFO:homer:Committing config for query asw2-c-eqiad* with message: T268825
INFO:homer:Gathering global Netbox data
INFO:homer.devices:Matched 1 device(s) for query 'asw2-c-eqiad*'
INFO:homer:Generating configuration for asw2-c-eqiad.mgmt.eqiad.wmnet
Configuration diff for asw2-c-eqiad.mgmt.eqiad.wmnet:

[edit interfaces interface-range disabled]
     member ge-3/0/13 { ... }
+    member ge-3/0/18;
     member ge-3/0/22 { ... }
[edit interfaces interface-range vlan-private1-c-eqiad]
-    member ge-3/0/18;
[edit interfaces]
-   ge-3/0/18 {
-       description "es1017:eno1 {#}";
-   }

Type "yes" to commit, "no" to abort.
> yes
INFO:homer.transports.junos:Committing the configuration on asw2-c-eqiad.mgmt.eqiad.wmnet
INFO:homer:Homer run completed successfully on 1 devices: ['asw2-c-eqiad.mgmt.eqiad.wmnet']

• Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Dec 2 2020, 2:26 PM

wiki_willy reassigned this task from wiki_willy to • Cmjohnson.Dec 2 2020, 4:25 PM

wiki_willy subscribed.

removed from rack, updated netbox and ran the script, confirmed network ports were already removed.

decommission es1017.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Description

Details

Related Objects

Event Timeline

decommission es1017.eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

Details

Related Objects

Event Timeline

decommission es1017.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions