Page MenuHomePhabricator

decom old appservers in eqiad
Closed, ResolvedPublic

Description

We need to decom old appservers in eqiad now that we have racked new servers in T241849.

We need to reduce power usage and make space for T245161 and T245099.

  • mw1221 - mw1226 ( rack D4, 6 servers)
  • mw1227 - mw1258 (rack D5, 30 servers)

Event Timeline

Dzahn added a parent task: Unknown Object (Task).
Dzahn updated the task description. (Show Details)

Change 580101 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool: remove mw1221 through mw1226

https://gerrit.wikimedia.org/r/580101

Change 580105 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove mw1221 through mw1226

https://gerrit.wikimedia.org/r/580105

Change 580107 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove production IPs of mw1221 through mw1226

https://gerrit.wikimedia.org/r/580107

Mentioned in SAL (#wikimedia-operations) [2020-03-16T20:04:38Z] <mutante> depool (yes->no) mw1221 - mw1226 (T247780)

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw1221.eqiad.wmnet

  • mw1221.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1222-1226].eqiad.wmnet

  • mw1222.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1223.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1224.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Failed to wipe bootloaders, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1225.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1226.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 580101 merged by Dzahn:
[operations/puppet@production] site/conftool: remove mw1221 through mw1226

https://gerrit.wikimedia.org/r/580101

Change 580105 merged by Dzahn:
[operations/puppet@production] DHCP: remove mw1221 through mw1226

https://gerrit.wikimedia.org/r/580105

Change 580384 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool: remove mw1238 through mw1243

https://gerrit.wikimedia.org/r/580384

Change 580384 merged by Dzahn:
[operations/puppet@production] site/conftool: remove mw1238 through mw1243

https://gerrit.wikimedia.org/r/580384

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1238-1239].eqiad.wmnet

  • mw1238.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1239.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Mentioned in SAL (#wikimedia-operations) [2020-03-17T18:39:32Z] <mutante> removing mw1238 through mw1243 - decom with cookbook (T247780 T245099)

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1240-1243].eqiad.wmnet

  • mw1240.eqiad.wmnet (FAIL)
    • Host steps raised exception: Empty Management Password
  • mw1241.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1242.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1243.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw1240.eqiad.wmnet

  • mw1240.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 580417 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove mw1238 through mw1243

https://gerrit.wikimedia.org/r/580417

Change 580107 merged by Dzahn:
[operations/dns@master] remove production IPs of mw1221 through mw1226

https://gerrit.wikimedia.org/r/580107

Change 580417 merged by Dzahn:
[operations/puppet@production] DHCP: remove mw1238 through mw1243

https://gerrit.wikimedia.org/r/580417

Change 580418 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove production IPs of mw1238 through mw1243

https://gerrit.wikimedia.org/r/580418

Change 580418 merged by Dzahn:
[operations/dns@master] remove production IPs of mw1238 through mw1243

https://gerrit.wikimedia.org/r/580418

Change 582160 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/conftool: decom mw1244-mw1249 and mw1227-mw1231

https://gerrit.wikimedia.org/r/582160

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 3 host(s) and their services with reason: decom

mw[1227-1229].eqiad.wmnet

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 2 host(s) and their services with reason: decom

mw[1230-1231].eqiad.wmnet

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 6 host(s) and their services with reason: decom

mw[1244-1249].eqiad.wmnet

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1227-1229].eqiad.wmnet

  • mw1227.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1228.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1229.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1230-1231].eqiad.wmnet

  • mw1230.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1231.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1244-1247].eqiad.wmnet

  • mw1244.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1245.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1246.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1247.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1248-1249].eqiad.wmnet

  • mw1248.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1249.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 582160 merged by Dzahn:
[operations/puppet@production] site/conftool: decom mw1244-mw1249 and mw1227-mw1231

https://gerrit.wikimedia.org/r/582160

Change 583114 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: decom mw125[0-3] and mw123[2-5]

https://gerrit.wikimedia.org/r/583114

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: decom

mw[1232-1235].eqiad.wmnet

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: decom

mw[1250-1253].eqiad.wmnet

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1232-1235].eqiad.wmnet

  • mw1232.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1233.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1234.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1235.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1250-1253].eqiad.wmnet

  • mw1250.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1251.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1252.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1253.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 583114 merged by Dzahn:
[operations/puppet@production] site: decom mw125[0-3] and mw123[2-5]

https://gerrit.wikimedia.org/r/583114

Change 583313 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove decom'ed appservers from rack D5

https://gerrit.wikimedia.org/r/583313

Change 583313 merged by Dzahn:
[operations/puppet@production] DHCP: remove decom'ed appservers from rack D5

https://gerrit.wikimedia.org/r/583313

Change 583377 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove IPs of recently decom'ed appservers in eqiad D5

https://gerrit.wikimedia.org/r/583377

Change 583575 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom mw1254 through mw1258, remaining rack D5 appservers

https://gerrit.wikimedia.org/r/583575

@Jclark-ctr @Cmjohnson @wiki_willy We (serviceops) are aware that currently there won't be onsite work except for emergencies. Additionally we also wanted to clarify that in this case of the old appservers we also _do not actually want them to be deracked yet_. So please do nothing here for now and all is good. Thanks!

Dzahn changed the task status from Open to Stalled.Mar 26 2020, 2:28 PM

Setting to stalled. We are waiting at least until Monday before removing the remaining 5 servers in rack D5.

Thanks for the heads up @Dzahn . @Jclark-ctr has been working on some of the other decom tasks this past week, but as long as this one doesn't show up on the eqiad workboard (project tagged with ops-eqiad), we should be fine. Also, currently the team is still available onsite approximately 4-8x per month...but it is definitely more limited now with frequency, due to various restrictions Equinix has put in place. Thanks, Willy

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: decom

mw[1254-1258].eqiad.wmnet

Icinga downtime for 1 day, 0:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: decom

mw[1254-1258].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-03-31T15:35:34Z] <mutante> decom mw1254 through mw1258 (last remaining old servers in rack D5, depooled a while ago and average response time is again under 200ms) T247780

Dzahn changed the task status from Stalled to Open.Mar 31 2020, 3:36 PM

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw[1254-1258].eqiad.wmnet

  • mw1254.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1255.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1256.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1257.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1258.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 583575 merged by Dzahn:
[operations/puppet@production] decom mw1254 through mw1258, remaining rack D5 appservers

https://gerrit.wikimedia.org/r/583575

All mw servers in the rack D5 are now decom'ed. There are a few non-mw servers in that rack that were unaffected but besides those D5 is mostly deactivated now.

Change 585185 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove mw1254-mw1258

https://gerrit.wikimedia.org/r/585185

@Cmjohnson @RobH

36 servers have been decom'ed. 30 in D5 and 6 in D4

But the original procurement ticket https://rt.wikimedia.org/Ticket/Display.html?id=8786 and installation ticket https://rt.wikimedia.org/Ticket/Display.html?id=8862 claim there were 38 servers.

I wonder where are the 2 missing ones? Are they maybe thumbor1003 and thumbor1004 in rack D5 and have been renamed from mw servers?

The installation ticket above said "32 of the 38 servers have been racked in D5. The remaining 6 will go somewhere else. Most likely D2 while unconventional row D has 3 10G racks which don't allow for 2 apache racks." but then went to 'resolved' without further comment.

Yes, it's thumbor1003 and thumbor1004, they are from the same procurement RT ticket.

36 of the 38 old servers from RT8786 have been decom'ed and the 2 thumbor servers are separate in T216815 or T233196.

There are a total of 187 mw1* servers. Of those 151 are in state "active" and 36 are in state "decommissioning".

All the ones from RT8786 are decom'ed.. This completes the ticket.

Dzahn mentioned this in Unknown Object (Task).Apr 2 2020, 7:31 PM

Change 585185 merged by Dzahn:
[operations/puppet@production] DHCP: remove mw1254-mw1258

https://gerrit.wikimedia.org/r/585185

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw1253.eqiad.wmnet

  • mw1253.eqiad.wmnet (FAIL)
    • Host steps raised exception: Empty Management Password

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw1253.eqiad.wmnet

  • mw1253.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 583377 merged by Dzahn:
[operations/dns@master] remove IPs of recently decom'ed appservers in eqiad D5

https://gerrit.wikimedia.org/r/583377

Change 595876 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove mw1254 - mw1258, they have been decom'ed

https://gerrit.wikimedia.org/r/595876

Change 595876 merged by Dzahn:
[operations/dns@master] remove mw1254 - mw1258, they have been decom'ed

https://gerrit.wikimedia.org/r/595876