Page MenuHomePhabricator

decom 8 codfw appservers purchased on 2016-06-02
Closed, ResolvedPublic

Description

After T277119 this is the next decom ticket for servers purchased on 2016-06-02 in T134272.

https://netbox.wikimedia.org/dcim/devices/?q=mw2&mac_address=&has_primary_ip=&local_context_data=&virtual_chassis_member=&console_ports=&console_server_ports=&power_ports=&power_outlets=&interfaces=&pass_through_ports=&cf_purchase_date=2016-06-02

  • mw2243 jobrunner (new jobrunner: mw2379)
  • mw2244 canary API replaced with mw2251
  • mw2245 canary API replaced with mw2252
  • mw2246 jobrunner (new jobrunner: mw2380)
  • mw2247 jobrunner (new jobrunner: mw2381)
  • mw2248 jobrunner (new jobrunner: mw2382)
  • mw2249 canary jobrunner (new canary: mw2278)
  • mw2250 canary jobrunner (new canary: mw2279)

Event Timeline

Dzahn renamed this task from decom codfw appservers purchased on 2016-06-02 to decom 7 codfw appservers purchased on 2016-06-02 .Mar 18 2021, 5:12 PM
Dzahn updated the task description. (Show Details)
Dzahn added a project: ops-codfw.

Change 673367 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool-data: turn mw2251,mw2252 into canaries

https://gerrit.wikimedia.org/r/673367

Change 673368 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool-data: decom mw2244,mw2245, former canary servers

https://gerrit.wikimedia.org/r/673368

Dzahn renamed this task from decom 7 codfw appservers purchased on 2016-06-02 to decom 8 codfw appservers purchased on 2016-06-02 .Mar 18 2021, 10:52 PM
Dzahn updated the task description. (Show Details)

6 out of 8 are jobrunners. Maybe best to wait for T274171 to have started and turn some new servers in A3 into jobrunners, then remove these in A4 afterwards.

Dzahn changed the task status from Open to Stalled.Mar 18 2021, 10:54 PM
Dzahn triaged this task as High priority.

Change 673367 merged by Dzahn:
[operations/puppet@production] site/conftool-data: turn mw2251,mw2252 into canaries

https://gerrit.wikimedia.org/r/673367

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2244.codfw.wmnet

  • mw2244.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2245.codfw.wmnet

  • mw2245.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 673368 merged by Dzahn:
[operations/puppet@production] site/conftool-data: decom mw2244,mw2245, former canary servers

https://gerrit.wikimedia.org/r/673368

Change 673630 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool-data: turn mw2278,mw2279 into canary jobrunners

https://gerrit.wikimedia.org/r/673630

@Papaul fyi, this one is separate from T277119. I had to somehow separate them and instead by rack this is by purchase date. You will see though that this is just A4 and A3 is already covered.

Change 673630 merged by Dzahn:
[operations/puppet@production] site/conftool-data: turn mw2278,mw2279 into canary jobrunners

https://gerrit.wikimedia.org/r/673630

Change 674137 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool-data: decom mw2249,mw2250 jobrunner canaries

https://gerrit.wikimedia.org/r/674137

@Dzahn thanks for the update. I am planning on racking mw2401 to mw2411 in A5 and not in A4 since A4 is a 10G rack , i will like to keep this rack only for 10G servers and not put in a 1G server in a 10G rack. Let me know if this is a problem.

Thanks

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2249.codfw.wmnet

  • mw2249.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

@Papaul We can do that, it isn't a problem. We can use A3 and A5. Thank you

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2250.codfw.wmnet

  • mw2250.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 674137 merged by Dzahn:
[operations/puppet@production] site/conftool-data: decom mw2249,mw2250 jobrunner canaries

https://gerrit.wikimedia.org/r/674137

Change 674727 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site/conftool-data: turn new servers mw2377,mw2378 into jobrunners

https://gerrit.wikimedia.org/r/674727

Change 674727 merged by Dzahn:
[operations/puppet@production] site/conftool-data: turn new servers mw2377,mw2378 into jobrunners

https://gerrit.wikimedia.org/r/674727

Change 674736 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site/conftool-data: decom jobrunners mw2243,mw2246,mw2247,mw2248

https://gerrit.wikimedia.org/r/674736

Change 676437 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: turn 4 more new servers into jobrunners

https://gerrit.wikimedia.org/r/676437

Change 676437 merged by Dzahn:

[operations/puppet@production] site/conftool-data: turn 4 more new servers into jobrunners

https://gerrit.wikimedia.org/r/676437

Dzahn changed the task status from Stalled to Open.Apr 1 2021, 8:40 PM

4 new jobrunners have been created. This can now continue.

Mentioned in SAL (#wikimedia-operations) [2021-04-01T20:42:39Z] <mutante> mw2243, mw2246, mw2247, mw2248 - depooled - replaced by mw2379, mw2380, mw2381, mw2382 ( T277780)

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2243.codfw.wmnet

  • mw2243.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2246.codfw.wmnet

  • mw2246.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2247.codfw.wmnet

  • mw2247.codfw.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Failed to power off, manual intervention required: Remote IPMI for mw2247.mgmt.codfw.wmnet failed (exit=1): b''
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2248.codfw.wmnet

  • mw2248.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 674736 merged by Dzahn:

[operations/puppet@production] site/conftool-data: decom jobrunners mw2243,mw2246,mw2247,mw2248

https://gerrit.wikimedia.org/r/674736

Dzahn reopened this task as Open.
Dzahn updated the task description. (Show Details)

@Papaul These were old servers in rack A4. They are also ready to go now.

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2247.codfw.wmnet

  • mw2247.codfw.wmnet (FAIL)
    • Downtimed host on Icinga

Not sure why, but the icinga downtime on this actually failed. I just set it manually.

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2247.codfw.wmnet

  • mw2247.codfw.wmnet (FAIL)
    • Downtimed host on Icinga
    • Host steps raised exception: Invalid management FQDN mw2247.mgmt.codfw.wmnet for mw2247.codfw.wmnet

ERROR: some step on some host failed, check the bolded items above

Not sure why, but the icinga downtime on this actually failed. I just set it manually.

Thank you! The original reason was:

Failed to power off, manual intervention required: Remote IPMI for mw2247.mgmt.codfw.wmnet failed (exit=1): b''

and then it got into a zombie state where it and its mgmt was gone from DNS but the host was still in PuppeDB, which meant it was also still in Icinga.

This was surprising because the logs for the first cookbook run explictly include the "removed from PuppetDB" step and how it ran 'puppet node clean/deactivate'.

Repeating the decom cookbook did not fix that situation.

Then I manually ran puppet node deactivate mw2247.codfw.wmnet on puppetmaster1001 followed by running puppet agent on alert1001 and this solved it. The Icinga config snippets go removed and its gone from the web UI now.

Another side-effect of this was that running mcrouter_generate_certs failed when trying to add new hosts. Because it looks up hosts from PuppetDB and found mw2247 but then failed to lookup its host name.

Thanks to @RLazarus for adding a debug line that told us which host name lookup actually fails. Originally you could not tell which of the hosts is actually the issue.

Mentioned in SAL (#wikimedia-operations) [2021-04-06T11:43:51Z] <moritzm> removed mw2247 from debmonitor T277780