Page MenuHomePhabricator

codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers
Closed, ResolvedPublic

Description

per T241852#5945821 15 new appservers in codfw can't be racked until at least 15 old servers have been removed.

Therefore rack C3 should have priority before other "decom old servers" tasks and we should start to take some out of there to make room.

Event Timeline

Change 579073 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom 15 codfw appservers

https://gerrit.wikimedia.org/r/579073

Mentioned in SAL (#wikimedia-operations) [2020-03-11T22:28:02Z] <mutante> depooled mw2167 through mw2172 - rack C3 (T247018)

mw2158 through mw2172 are permanently depooled (state=inactive) now. That's exactly 15 servers from the middle of C3. Set them to status "decommissioning" in netbox.

Papaul renamed this task from decom at least 15 appservers in codfw rack C3 to make room for new servers to codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers.Mar 12 2020, 11:53 PM

Icinga downtime for 12:00:00 set by dzahn@cumin1001 on 15 host(s) and their services with reason: decom

mw[2158-2172].codfw.wmnet

mw2158 through mw2172 are permanently depooled (state=inactive) now. That's exactly 15 servers from the middle of C3. Set them to status "decommissioning" in netbox.

Reverted this. They are back in the pool for now because this got stalled for a week or 2, so why not keep using them for right now.

Dzahn changed the task status from Open to Stalled.Apr 3 2020, 9:38 AM
Dzahn changed the task status from Stalled to Open.May 21 2020, 11:42 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-21T12:13:19Z] <mutante> depooled mw2158 through mw2172 to make room again in C3 as planned (T247018)

Change 597771 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: remove 13 old jobrunners from codfw rack C3

https://gerrit.wikimedia.org/r/597771

Icinga downtime for 4 days, 0:00:00 set by dzahn@cumin1001 on 15 host(s) and their services with reason: decom

mw[2158-2172].codfw.wmnet

Change 597771 merged by Dzahn:
[operations/puppet@production] site: remove 13 old jobrunners from codfw rack C3

https://gerrit.wikimedia.org/r/597771

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2150.codfw.wmnet

  • mw2150.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2151-2155].codfw.wmnet

  • mw2151.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2152.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2153.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2154.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2155.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2156-2159].codfw.wmnet

  • mw2156.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2157.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2158.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2159.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2160-2162].codfw.wmnet

  • mw2160.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2161.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2162.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Dzahn renamed this task from codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers to codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers.May 22 2020, 11:14 AM

Change 598025 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: decom mw2163 through mw2169 appservers

https://gerrit.wikimedia.org/r/598025

Change 598025 merged by Dzahn:
[operations/puppet@production] site: decom mw2163 through mw2169 appservers

https://gerrit.wikimedia.org/r/598025

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2163-2166].codfw.wmnet

  • mw2163.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2164.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2165.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2166.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2167-2169].codfw.wmnet

  • mw2167.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2168.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2169.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

@Papaul 23 servers from rack C3 have been decom'ed. mw2150 through mw2172. (lower part of the rack)

You can:

  • remove these physically from the rack
  • use the space for your planned test of new cabling schema
  • rack new servers in their place

Next week I am planning to remove even more so we might be able to empty out almost the entire C3.

Change 598036 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: decom mw2170 - mw2172

https://gerrit.wikimedia.org/r/598036

Change 598039 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove recently decom'ed codfw appservers

https://gerrit.wikimedia.org/r/598039

Change 598040 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mcrouter: replace proxy in codfw row C

https://gerrit.wikimedia.org/r/598040

Change 598040 merged by Dzahn:
[operations/puppet@production] mcrouter: replace proxy in codfw row C

https://gerrit.wikimedia.org/r/598040

Change 598036 merged by Dzahn:
[operations/puppet@production] site: decom mw2170 - mw2172

https://gerrit.wikimedia.org/r/598036

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2170-2172].codfw.wmnet

  • mw2170.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2171.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2172.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 598039 merged by Dzahn:
[operations/puppet@production] DHCP: remove recently decom'ed codfw appservers

https://gerrit.wikimedia.org/r/598039

Technically resolved because we made more than enough room for the 5 (not 15 anymore, 10 were used for T252185) servers.

This unblocked T241852.

I'll keep it though to do more decoms next week and maybe clean all of C3 out so dcops can use it for new cabling.

@Dzahn Please to not resolve yet. I still have mgmt DNS and switch port to remove.

Thanks

Change 599603 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: decom mw2173 through mw2179

https://gerrit.wikimedia.org/r/599603

Change 599606 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: decom mw2180 through mw2186

https://gerrit.wikimedia.org/r/599606

Change 579073 abandoned by Dzahn:
decom 15 codfw appservers from rack C3

Reason:
duplicate, already done in other changes

https://gerrit.wikimedia.org/r/579073

Change 599610 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove production IPs of mw2163 through mw2172

https://gerrit.wikimedia.org/r/599610

Change 599610 merged by Dzahn:
[operations/dns@master] remove production IPs of mw2163 through mw2172

https://gerrit.wikimedia.org/r/599610

Change 599614 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove production IPs of mw2150 through mw2162

https://gerrit.wikimedia.org/r/599614

Change 599614 merged by Dzahn:
[operations/dns@master] remove production IPs of mw2150 through mw2162

https://gerrit.wikimedia.org/r/599614

@Dzahn Please to not resolve yet. I still have mgmt DNS and switch port to remove.

@Papaul I also still had to remove production IPs. Done for mw2150 through mw2172.

Also uploaded new changes to decom mw2173 through mw2186 but they are not merged yet. Once they are this rack will be empty (of mw servers, minus the special cases that are not mw).

I see you started racking 4 of the new servers and made https://gerrit.wikimedia.org/r/c/operations/puppet/+/599749 for that (for now).

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2173.codfw.wmnet

  • mw2173.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2174-2177].codfw.wmnet

  • mw2174.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2175.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2176.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2177.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2178-2179].codfw.wmnet

  • mw2178.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2179.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 599603 merged by Dzahn:
[operations/puppet@production] site: decom mw2173 through mw2179

https://gerrit.wikimedia.org/r/599603

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2180-2183].codfw.wmnet

  • mw2180.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2181.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2182.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2183.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[2184-2186].codfw.wmnet

  • mw2184.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2185.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2186.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 599606 merged by Dzahn:
[operations/puppet@production] site: decom mw2180 through mw2186

https://gerrit.wikimedia.org/r/599606

Change 602012 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove production IPs for mw2173 through mw2186

https://gerrit.wikimedia.org/r/602012

Change 602012 merged by Dzahn:
[operations/dns@master] remove production IPs for mw2173 through mw2186

https://gerrit.wikimedia.org/r/602012

Change 602013 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove mgmt IPs for mw2150 through mw2186

https://gerrit.wikimedia.org/r/602013

@Papaul All remaining old mw servers in rack C3 (mw2154 through mw2186) are also decom'ed now. Removed from site, production IPs removed, only pending removing mgmt IPs (https://gerrit.wikimedia.org/r/c/operations/dns/+/602013).

You can feel free to remove and replace them with new servers and use C3 for new cabling (except the thumbor/db servers). Let me know when we are ready to merge the change above to remove mgmt IPs as well.

I am assigning to you for now because you said "don't resolve yet, i still have switch ports to do" etc. Please assign back to me once that is done.

Dzahn lowered the priority of this task from High to Medium.

switch ports removed for mw2154 through mw2186

switch ports removed for mw2154 through mw2186

@Papaul Thanks! Does that mean https://gerrit.wikimedia.org/r/c/operations/dns/+/602013 can be merged?

Change 602013 merged by Dzahn:
[operations/dns@master] remove mgmt IPs for mw2150 through mw2186

https://gerrit.wikimedia.org/r/602013

@Papaul done! Do you still have anything to do here on your side?

@Dzahn yes i have to setup all the decom servers to offline