Page MenuHomePhabricator

decom wtp1001-wtp1024
Closed, ResolvedPublic

Description

wtp1001 to wtp1024 are old and have been refreshed already. wtp1025 to wtp1048 are new machines and are up and running successfully for days so we can proceed with decommissioning wtp1001-wtp1024

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

Event Timeline

akosiaris created this task.Oct 4 2017, 8:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2017, 8:45 AM

The boxes are old enough (Jan 2013, soon to be 5 years old) to warrant full removal.

Change 382135 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable notifications for wtp1001-wtp1024

https://gerrit.wikimedia.org/r/382135

Change 382136 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] decom wtp1001-wtp1024

https://gerrit.wikimedia.org/r/382136

Change 382135 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable notifications for wtp1001-wtp1024

https://gerrit.wikimedia.org/r/382135

Mentioned in SAL (#wikimedia-operations) [2017-10-04T10:18:51Z] <akosiaris> T177374, fully depool wtp1001-wtp1024

Change 382136 merged by Alexandros Kosiaris:
[operations/puppet@production] decom wtp1001-wtp1024

https://gerrit.wikimedia.org/r/382136

akosiaris updated the task description. (Show Details)Oct 4 2017, 10:44 AM
akosiaris updated the task description. (Show Details)

Change 382152 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Remove DNS entries for wtp1001-wtp1024

https://gerrit.wikimedia.org/r/382152

akosiaris updated the task description. (Show Details)Oct 4 2017, 11:10 AM

Change 382152 merged by Alexandros Kosiaris:
[operations/dns@master] Remove DNS entries for wtp1001-wtp1024

https://gerrit.wikimedia.org/r/382152

akosiaris reassigned this task from akosiaris to Cmjohnson.Oct 4 2017, 12:39 PM
Arlolra added a subscriber: Arlolra.Oct 4 2017, 8:41 PM

Parsing-Team could have used a ping here, since our canaries are still hardcoded to the decommissioned nodes.
https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/target-canary#L5-L6

Change 382255 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid/deploy@master] Update eqiad canaries to wtp1025 and wtp1026

https://gerrit.wikimedia.org/r/382255

Change 382255 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Update eqiad canaries to wtp1025 and wtp1026

https://gerrit.wikimedia.org/r/382255

Change 382416 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] scap::dsh: Create the parsoid-canaries group

https://gerrit.wikimedia.org/r/382416

Change 382418 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/parsoid/deploy@master] Use parsoid-canaries dsh group

https://gerrit.wikimedia.org/r/382418

Parsing-Team could have used a ping here, since our canaries are still hardcoded to the decommissioned nodes.
https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/target-canary#L5-L6

Indeed I should have pinged. I completely forgot about the canaries and thought it would have been a noop.

Arguably, that information should not even be in the parsoid deploy repo at all, because there is no reason for the software to have any direct information about where it is deployed. I 've uploaded https://gerrit.wikimedia.org/r/382416, creating a parsoid-canaries dsh group, mimicking the mediawiki-api-canaries group. The parsoid deploy repo relevant change is https://gerrit.wikimedia.org/r/382418. Unless you object, I 'll go ahead and migrate to that scheme in the next few days

Change 382416 merged by Alexandros Kosiaris:
[operations/puppet@production] scap::dsh: Create the parsoid-canaries group

https://gerrit.wikimedia.org/r/382416

Change 382418 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Use parsoid-canaries dsh group

https://gerrit.wikimedia.org/r/382418

Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Oct 8 2017, 2:44 PM

FWIW two days ago three hosts that were decom as part of this task showed up in icinga (ack'd the alerts now):

wtp1018.mgmt 

DOWN	2018-01-08 11:12:05	2d 15h 14m 11s	1/2	PING CRITICAL - Packet loss = 100%	
wtp1018 
		
DOWN	2018-01-08 11:11:36	2d 15h 0m 12s	1/2	PING CRITICAL - Packet loss = 100%	
wtp1016 
		
DOWN	2018-01-08 11:11:36	2d 15h 0m 2s	1/2	PING CRITICAL - Packet loss = 100%	
wtp1015 
		
DOWN	2018-01-08 11:11:36	2d 15h 0m 2s	1/2	PING CRITICAL - Packet loss = 100%

Mentioned in SAL (#wikimedia-operations) [2018-01-08T11:28:34Z] <godog> puppet node deactivate wtp10[568] - T177374

This was me last week, these servers have not gone through the decom steps
yet and still have puppet running.

Yeah, this has uncovered an unfortunate issue in our decomissioning/reimaging process. See T184444 for more info

ssastry moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Jan 10 2018, 10:28 PM
RobH triaged this task as Normal priority.Feb 8 2018, 7:10 PM

Change 416707 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns wtp1001-1024

https://gerrit.wikimedia.org/r/416707

Change 420838 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns wtp1001-1024

https://gerrit.wikimedia.org/r/420838

Change 416707 abandoned by Cmjohnson:
Removing mgmt dns wtp1001-1024

Reason:
Duplicated

https://gerrit.wikimedia.org/r/416707

Change 420838 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns wtp1001-1024

https://gerrit.wikimedia.org/r/420838

Cmjohnson closed this task as Resolved.Mar 20 2018, 9:00 PM
Cmjohnson updated the task description. (Show Details)