Page MenuHomePhabricator

decommission mw2097-mw2134
Closed, ResolvedPublic

Description

All these systems are in row B, and removing them should allow better distributing servers in the future.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. (these never had the description set, see task comment history)
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Related Objects

Event Timeline

Joe triaged this task as Normal priority.Mar 7 2018, 12:08 PM
Joe created this task.
Papaul reassigned this task from Papaul to Joe.Mar 8 2018, 2:44 AM

@Joe this needs to be assigned first to someone with root access to do the first 2 steps when complete, assign to me.

Thanks

RobH updated the task description. (Show Details)Mar 8 2018, 4:25 PM
Joe added a comment.Mar 12 2018, 9:17 AM

@Papaul thanks, doing it now!

Joe moved this task from Backlog to Doing on the User-Joe board.Mar 12 2018, 9:19 AM

Change 418874 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] codfw: decommission mw2097-mw2134

https://gerrit.wikimedia.org/r/418874

Mentioned in SAL (#wikimedia-operations) [2018-03-12T09:31:35Z] <_joe_> decommission mw2097-mw2134 from conftool T189111

Change 418874 merged by Giuseppe Lavagetto:
[operations/puppet@production] codfw: decommission mw2097-mw2134

https://gerrit.wikimedia.org/r/418874

Change 418880 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] codfw: remove stale references to mw2118-9

https://gerrit.wikimedia.org/r/418880

Change 418880 merged by Giuseppe Lavagetto:
[operations/puppet@production] codfw: remove stale references to mw2118-9

https://gerrit.wikimedia.org/r/418880

Mentioned in SAL (#wikimedia-operations) [2018-03-12T10:39:12Z] <_joe_> running decommission_appserver on mw2097-2134 T189111

Joe updated the task description. (Show Details)Mar 12 2018, 10:40 AM
Joe added a subscriber: Cmjohnson.Mar 12 2018, 10:49 AM

@Papaul I did the part I can do myself, I was told in the past not to do things in step 2 without coordination with dc-ops, as that messes up your process. Probably either @RobH or @Cmjohnson can help there?

Assigning to @RobH in order to get clarity on how to proceed.

Joe reassigned this task from Joe to RobH.Mar 12 2018, 10:50 AM

Mentioned in SAL (#wikimedia-operations) [2018-03-12T17:27:12Z] <_joe_> poweroff mw2097-2134, T189111

Change 418958 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] puppet: remove all references to mw2097-2134

https://gerrit.wikimedia.org/r/418958

Change 418958 merged by Giuseppe Lavagetto:
[operations/puppet@production] puppet: remove all references to mw2097-2134

https://gerrit.wikimedia.org/r/418958

Change 418960 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/dns@master] Decommission mw2097-mw2134

https://gerrit.wikimedia.org/r/418960

Change 418960 merged by Giuseppe Lavagetto:
[operations/dns@master] Decommission mw2097-mw2134

https://gerrit.wikimedia.org/r/418960

Joe updated the task description. (Show Details)Mar 12 2018, 6:09 PM
RobH added a comment.Mar 12 2018, 6:17 PM

So I reviewed this with @Joe, and we had to make the implicient decision to skip some of the decom steps. Specifically, these systems don't have their network ports labeled on the switch. Additionally, though we have their MAC addresses, they aren't showing in the switch stack ethernet switching table.

Example:

mw2097 is in rack b3. It has a mac of 90:B1:1C:25:95:57, but:

robh@asw-b-codfw> show ethernet-switching table 90:B1:1C:25:95:57    

{master:2}
robh@asw-b-codfw>

So @Joe went ahead and killed all of these hosts by stopping puppet, removing all services from automatically firing up on reboot, and powering them down. We cannot really disable their switch ports without having @Papaul trace them, and it seems silly to add another step to this decom batch just for that.

Normally we'd insist on tracing each and disabling their ports, but most of the codfw mw systems were racked without the ports being labeled properly during the build out, and now we're paying for it.

I'll make another task to ensure all codfw switch ports are audited for their descriptions against their actual use.

RobH reassigned this task from RobH to Papaul.Mar 12 2018, 6:18 PM

@Papaul: You can take this over for onsite disk wipes at this time.

RobH added a comment.Mar 12 2018, 7:12 PM

So a bunch of these just alerted in icinga:

11:59 < icinga-wm>  :  PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2112 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2113 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2115 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2116 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2120 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2121 is DOWN: PING CRITICAL - Packet loss = 100%
11:59 < icinga-wm>  :  PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100%

Seems these still are in monitoring, and likely also in puppet still? (checking)

BBlack added a subscriber: BBlack.Mar 12 2018, 7:32 PM

(I acked those with a ref to this ticket for now, to reduce overall icinga redness)

Joe moved this task from Doing to Blocked on others on the User-Joe board.Mar 13 2018, 5:14 PM
RobH moved this task from Backlog to Decommission on the ops-codfw board.Mar 15 2018, 5:13 PM
elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Mar 16 2018, 2:46 PM
Papaul updated the task description. (Show Details)Mar 26 2018, 4:27 PM

@rob
on asw-b3-codfw any port from ge-3/0/20 up needs to be removed and disabled from the switch.
on asw-b4-codfw any port from ge-4/0/0 to ge-4/0/15 needs to be removed and disabled from the switch as well.

Papaul updated the task description. (Show Details)Mar 26 2018, 4:35 PM
Papaul removed Papaul as the assignee of this task.Mar 26 2018, 7:53 PM

@RobH if you have time can you do the switch port session. When finished assign back to me so i can finish the mgmt part.

Thanks.

Papaul assigned this task to RobH.Mar 26 2018, 7:53 PM
RobH reassigned this task from RobH to Papaul.Mar 26 2018, 8:02 PM

So none of those interfaces had descriptions set. I had to add them all into the config (though they were in use by mw systems, just not properly setup with port descriptions or individually enabled.

+   ge-3/0/20 {
+       disable;
+   }
+   ge-3/0/21 {
+       disable;
+   }
+   ge-3/0/22 {
+       disable;
+   }
+   ge-3/0/23 {
+       disable;
+   }
+   ge-3/0/24 {
+       disable;
+   }
+   ge-3/0/25 {
+       disable;
+   }
+   ge-3/0/26 {
+       disable;
+   }
+   ge-3/0/27 {
+       disable;
+   }
+   ge-3/0/28 {
+       disable;
+   }
+   ge-3/0/29 {
+       disable;
+   }
+   ge-3/0/30 {
+       disable;
+   }
+   ge-3/0/31 {
+       disable;
+   }
+   ge-3/0/32 {
+       disable;
+   }
+   ge-3/0/33 {
+       disable;
+   }
+   ge-3/0/34 {
+       disable;
+   }
+   ge-3/0/35 {
+       disable;
+   }
+   ge-3/0/36 {
+       disable;
+   }
+   ge-3/0/37 {
+       disable;
+   }
+   ge-3/0/38 {
+       disable;
+   }                                   
+   ge-3/0/40 {
+       disable;
+   }
+   ge-3/0/41 {
+       disable;
+   }
+   ge-3/0/42 {
+       disable;
+   }
+   ge-4/0/0 {
+       disable;
+   }
+   ge-4/0/1 {
+       disable;
+   }
+   ge-4/0/2 {
+       disable;
+   }
+   ge-4/0/3 {
+       disable;
+   }
+   ge-4/0/4 {
+       disable;
+   }
+   ge-4/0/5 {
+       disable;
+   }
+   ge-4/0/6 {
+       disable;
+   }
+   ge-4/0/7 {
+       disable;
+   }
+   ge-4/0/8 {
+       disable;
+   }
+   ge-4/0/9 {
+       disable;
+   }
+   ge-4/0/10 {
+       disable;
+   }
+   ge-4/0/11 {
+       disable;
+   }
+   ge-4/0/12 {
+       disable;
+   }
+   ge-4/0/13 {
+       disable;
+   }
+   ge-4/0/14 {
+       disable;
+   }
+   ge-4/0/15 {
+       disable;
+   }

Since they had no descriptions, there is nothing to remove when the systems are unracked.

Assigning back to @Papaul per his request!

RobH updated the task description. (Show Details)Mar 26 2018, 8:02 PM
RobH updated the task description. (Show Details)

Change 422054 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS entries for mw2097-mw2134

https://gerrit.wikimedia.org/r/422054

Papaul updated the task description. (Show Details)Mar 26 2018, 8:33 PM

Change 422054 merged by Dzahn:
[operations/dns@master] DNS: Remove mgmt DNS entries for mw2097-mw2134

https://gerrit.wikimedia.org/r/422054

Papaul closed this task as Resolved.Mar 27 2018, 3:21 PM

Complete