decommission mw2075-2089 to make room for new systems
Closed, ResolvedPublic

Description

10 new systems were ordered on parent task T154698. Those 10 systems will replace the following systems:

My suggestion would be that these 10 new systems should replace mw2075 - mw2089 functionally, and specifically:

  • 3 servers to replace the 5 API appservers mw2075-79 so in row A
  • 4 servers to replace 6 jobrunners mw2080-2085 so in row B
  • 3 servers to replace 4 imagescalers mw2086-2090 so in row B

Please note that while @Joe listed up to mw2290, it did not include it, as it isn't an imagescaler. (He also lists 4 image scalers to be decomed, but range includes 5 systems, 4 imagei and 1 general.)

After further discussion on that task, @Joe pointed out that the old systems can come offline in advance of the new ones. I'll (@RobH) will be pulling them for decommission today/tomorrow. Then @Papaul will take over for the on-site steps of wiping the disks and unracking the systems.

Please note when these systems are unracked, @Papaul may want to leave all cables in place for the new systems that will eventually go in those spots.

Servers for decommission: mw2075-2079 (api), mw2080-2085 (job), mw2086-2089 (image)

Steps for each system in decommissioning:

  • - disable all service level checks in icinga for hosts (done for all hosts mw2075-20989 set in maint/downtime)
  • - depool from pybal (sudo -i confctl select name=,<fqdn hostname> set/pooled=no (done for all hosts)
  • - disable puppet on hosts
  • - remove from puppet, includes: conftool-data, install_server, hiera - https://gerrit.wikimedia.org/r/#/c/330621/
  • - system shutdown
  • - pull production dns entires
  • - disable network port
  • - puppet node clean and deactive
  • - salt key revoked
  • - hand off system to @Papaul for disk wipe.
  • - disk wiped
  • - systems unracked, racktables updated

Please note that the mgmt dns entires and the network port description are not removed until AFTER system is unracked.

  • - remove mgmt dns entires
  • - remove description from switch port config
RobH created this task.Jan 4 2017, 10:31 PM

Mentioned in SAL (#wikimedia-operations) [2017-01-04T22:52:15Z] <robh> all my server depools and decoms for the mw range are on T154621

RobH added a comment.Jan 4 2017, 10:53 PM

depooled mw2075-2079, will get to the rest post-meeting.

RobH updated the task description. (Show Details)Jan 5 2017, 1:01 AM
RobH updated the task description. (Show Details)Jan 5 2017, 1:12 AM
RobH updated the task description. (Show Details)Jan 5 2017, 1:20 AM
RobH added a comment.Jan 5 2017, 1:22 AM

All systems have been depooled from pybal and should stop getting loads. I'll disable and shutdown the systems tomorrow, and then continue with the decommissioning.

This will leave them in the state that they can be wiped by midday tomorrow (once I finish the other steps before wipe.)

RobH updated the task description. (Show Details)Jan 5 2017, 5:38 PM

Mentioned in SAL (#wikimedia-operations) [2017-01-05T17:57:42Z] <robh> shutting down mw2075-2089 for decom per T154621

RobH updated the task description. (Show Details)Jan 5 2017, 6:14 PM
RobH added a comment.EditedJan 5 2017, 6:20 PM

So, the port info for mw2079 onwards is missing off the switch stacks. So I'm not sure which exact network ports to disable.

@Papaul:

Please list off the network ports for the following systems so I can disable them:

mw2079
mw2080
mw2081
mw2082
mw2083
mw2084
mw2085
mw2086
mw2087
mw2088
mw2089

I need to disable their ports, so if they power up, they wont call back in and start trying to operate/call to puppet.

Once the network ports are disabled, they can be powered up and wiped.

This morning a deployment by Ariel of a mw-config throttling change failed since scap tried to connect to mw2080-mw2085, which have been powered down. They're still listed in conftool-data/eqiad.yaml (sic) and need to be removed as well. There's several further codfw hosts listed in eqiad.yaml; I'm not sure if that's for a technical reason, but otherwise we should move them to codfw.yaml to reduce confusion.

Papaul added a comment.Jan 6 2017, 3:07 PM

mw2079 ge-4/0/38
mw2080 ge-3/0/0
mw2081 ge-3/0/1
mw2082 ge-3/0/2
mw2083 ge-3/0/3
mw2084 ge-3/0/4
mw2085 ge-3/0/5
mw2086 ge-3/0/6
mw2087 ge-3/0/7
mw2088 ge-3/0/8
mw2089 ge-3/0/9

RobH added a comment.Jan 6 2017, 4:35 PM

This morning a deployment by Ariel of a mw-config throttling change failed since scap tried to connect to mw2080-mw2085, which have been powered down. They're still listed in conftool-data/eqiad.yaml (sic) and need to be removed as well. There's several further codfw hosts listed in eqiad.yaml; I'm not sure if that's for a technical reason, but otherwise we should move them to codfw.yaml to reduce confusion.

Sorry about that, fixed!

RobH updated the task description. (Show Details)Jan 6 2017, 4:47 PM
RobH renamed this task from decommission old mw appservers to make room for new systems to decommission mw2075-2089 to make room for new systems.Jan 6 2017, 5:05 PM
RobH reassigned this task from RobH to Papaul.
RobH updated the task description. (Show Details)

This task is now assigned to @Papaul for the disk wipes. Once the disks are wiped and the systems are pulled from the racks, I'll remove their network port entries and the mgmt entries can be removed.

Please update task when systems have been unracked and assign back to me for the above. Thanks!

RobH updated the task description. (Show Details)Jan 6 2017, 5:06 PM

@RobH Joe mentioned "3 servers to replace 4 imagescalers mw2086-2090 so in row B" but we didn't decommissioned mw2090

RobH added a comment.EditedJan 10 2017, 8:25 PM

correct, that seems to be a typo on his part. He mentioned imagescalers, and 4 of them, but that is 5 systems listed. mw2090 is not an image scaler, and should be left in service.

RobH updated the task description. (Show Details)Jan 10 2017, 8:33 PM
Papaul updated the task description. (Show Details)Jan 12 2017, 1:59 AM
Papaul reassigned this task from Papaul to RobH.
RobH closed this task as Resolved.Jan 18 2017, 6:31 PM
RobH updated the task description. (Show Details)

mw2075 was still shown in servermon, I ran "puppet node clean mw2075.codfw.wmnet" and "puppet node deactivate mw2075.codfw.wmnet" to fix it.