⚓ T316285 decommission cloudservices1003.wikimedia..org

	Subject	Repo	Branch	Lines +/-
	Remove refs to cloudservices1003	operations/puppet	production	+1 -5

Andrew created this task.Aug 25 2022, 7:23 PM

Restricted Application added a project: cloud-services-team (Kanban). · View Herald TranscriptAug 25 2022, 7:23 PM

Andrew added a parent task: T304888: Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts.Aug 25 2022, 7:24 PM

Change 826645 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Remove refs to cloudservices1003

https://gerrit.wikimedia.org/r/826645

gerritbot added a project: Patch-For-Review.Aug 25 2022, 7:27 PM

Change 826645 merged by Andrew Bogott:

[operations/puppet@production] Remove refs to cloudservices1003

https://gerrit.wikimedia.org/r/826645

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003

cloudservices1003 (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Maintenance_bot removed a project: Patch-For-Review.Aug 25 2022, 8:31 PM

@Cmjohnson, this is another host that will need its drives wiped, as the cookbook seems to be bad at that lately. Thanks!

Andrew mentioned this in T316292: decom cookbook often fails to wipe drives in HP systems.Aug 25 2022, 8:51 PM

Maintenance_bot added a project: SRE.Aug 25 2022, 9:29 PM

In T316285#8187029, @Andrew wrote:

@Cmjohnson, this is another host that will need its drives wiped, as the cookbook seems to be bad at that lately. Thanks!

The host was not reachable by cumin at the time of decom, see T316292#8189015.

It seems that the runbook did not cleanup puppetdb or it was repopulated right after, as the host still shows there:

https://debmonitor.wikimedia.org/hosts/cloudservices1003.wikimedia.org
https://puppetboard.wikimedia.org/node/cloudservices1003.wikimedia.org

You can see that the time is during the decom.

This is triggering some alerts:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netbox1002&service=Netbox+report+puppetdb_physical
Details of the report:
https://netbox.wikimedia.org/extras/reports/results/3712993/

So we should figure out how to clean those up (thanks @ayounsi for the notice on irc)

The decom cookbook is meant to be idempotent, so you can safely re-run it. That said I can look next week on the logs of the previous run to check if something went obviously wrong. For example if the host wasn't powered off that could explain the re-appearance to puppetdb and debmonitor.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003

cloudservices1003 (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- No DNS record found for the mgmt interface cloudservices1003.mgmt.eqiad.wmnet, trying the asset tag one: wmf7225.mgmt.eqiad.wmnet
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

It looks like the decom script can't find it in puppetdb even though the alert says it is in puppetdb.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003.wikimedia.org

cloudservices1003.wikimedia.org (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- No DNS record found for the mgmt interface cloudservices1003.mgmt.eqiad.wmnet, trying the asset tag one: wmf7225.mgmt.eqiad.wmnet
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003.wikimedia.org

cloudservices1003.wikimedia.org (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- No DNS record found for the mgmt interface cloudservices1003.mgmt.eqiad.wmnet, trying the asset tag one: wmf7225.mgmt.eqiad.wmnet
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

I am now officially out of ideas :) Over to you, @Volans!

@Andrew what is the issue that you're still seeing? It looks good to me.

I see that the host is correctly powered off:

>>> i.power_status()
INFO:spicerack.ipmi:Running IPMI command: ipmitool -I lanplus -H wmf7225.mgmt.eqiad.wmnet -U root -E chassis power status
'off'

It's not present, as expected, in:

Puppetboard (hence puppetdb): https://puppetboard.wikimedia.org/node/cloudservices1003.wikimedia.org
Debmonitor: https://debmonitor.wikimedia.org/search?q=cloudservices1003
Puppet certs (from sudo puppet cert list --all | grep cloudservices1003)
Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=cloudservices1003

and it's marked as decommissioning in Netbox: https://netbox.wikimedia.org/dcim/devices/1596/

• Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Sep 14 2022, 4:12 PM

wiki_willy assigned this task to • Cmjohnson.Sep 28 2022, 1:10 AM

Finished Decom process

Status	Subtype	Assigned	Task
			Unknown Object (Task)
Resolved		• Cmjohnson	T304888 Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts
Resolved	Request	Jclark-ctr	T316285 decommission cloudservices1003.wikimedia..org

decommission cloudservices1003.wikimedia..org
Closed, ResolvedPublicRequest
Actions

Description

Details

Related Objects
Search...

Event Timeline

decommission cloudservices1003.wikimedia..orgClosed, ResolvedPublicRequestActions

Description

Details

Related ObjectsSearch...

Event Timeline

decommission cloudservices1003.wikimedia..org
Closed, ResolvedPublicRequest
Actions

Related Objects
Search...