Page MenuHomePhabricator

Decommission analytics100[1,2]
Open, MediumPublic

Description

analytics1001:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - set role(spare::system)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

analytics1002:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - set role(spare::system)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Details

Related Gerrit Patches:
operations/dns : masterdecom analytics100[12] production dns
operations/puppet : productiondecom analytics100[12]
operations/puppet : productionnetwork::constants: remove analytics100[1,2]

Event Timeline

elukey triaged this task as Medium priority.Sep 26 2018, 6:51 AM
elukey created this task.
elukey added projects: ops-eqiad, Operations.

Change 462857 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: remove analytics100[1,2]

https://gerrit.wikimedia.org/r/462857

Change 462857 merged by Elukey:
[operations/puppet@production] network::constants: remove analytics100[1,2]

https://gerrit.wikimedia.org/r/462857

elukey updated the task description. (Show Details)Sep 26 2018, 6:58 AM
elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Sep 26 2018, 7:02 AM
Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Sep 27 2018, 3:15 PM
fdans moved this task from Incoming to Radar on the Analytics board.Oct 1 2018, 3:57 PM
elukey reassigned this task from elukey to RobH.Jan 7 2019, 3:20 PM
elukey changed the task status from Open to Stalled.Jan 7 2019, 3:25 PM
elukey claimed this task.
elukey added subscribers: faidon, RobH.

Didn't realize that the task was still assigned to me, apologies :)

This is a good thing though since Analytics would need to test some very important Hadoop security settings on a testing cluster this quarter, and we reached an agreement with @faidon to keep around analytics1028->41 for a couple of months more (instead of decomming them) to test on them a new configuration for the Hadoop Analytics cluster before reaching the real one. Testing in cloud/labs is extremely difficult and it wouldn't allow us to test all our use cases.

Would it be feasible to keep these two hosts around too to re-use them in the same testing scenario?

RobH added a subscriber: mark.Jan 7 2019, 3:48 PM

Didn't realize that the task was still assigned to me, apologies :)
Would it be feasible to keep these two hosts around too to re-use them in the same testing scenario?

These two systems will become 5 years old in March of 2019. I would NOT advise using hardware this old for anything, particularly since this was already slated for decommission.

Overall though, this is a decision of SRE Directors, and if they want to allow 5+ year old hardware to stay in rotation. So you want to keep not just analytics1028->41, but analytics100[12] as well. We should get this approved by either @faidon or @mark, as its using 5 year old hardware.

elukey changed the task status from Stalled to Open.Jan 7 2019, 3:53 PM
elukey reassigned this task from elukey to RobH.

Nevermind then, I can easily use only analytics1028->41, we are good to decom. Thanks!

RobH added a comment.Mar 6 2019, 11:13 PM

analytics1001:asw2-c-eqiad:ge-4/0/16
analytics1002:asw2-d-eqiad:ge-8/0/7

RobH added a comment.Mar 6 2019, 11:19 PM
[edit interfaces interface-range vlan-analytics1-d-eqiad]
     member xe-7/0/3 { ... }
+    member ge-9/0/4;
+    member ge-9/0/5;
+    member ge-9/0/6;
+    member ge-9/0/8;
+    member ge-9/0/9;
+    member ge-9/0/10;
+    member ge-9/0/11;
-    member "ge-8/0/[4-11]";
[edit interfaces interface-range disabled]
     member xe-7/0/47 { ... }
+    member ge-8/0/7;

wmf-decommission-host was executed by robh for analytics1001.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for analytics1002.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH removed RobH as the assignee of this task.Mar 6 2019, 11:21 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

Change 494856 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom analytics100[12]

https://gerrit.wikimedia.org/r/494856

Change 494856 merged by RobH:
[operations/puppet@production] decom analytics100[12]

https://gerrit.wikimedia.org/r/494856

Change 494857 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom analytics100[12] production dns

https://gerrit.wikimedia.org/r/494857

elukey added a comment.EditedMar 7 2019, 8:52 AM

Proposal for fix:

elukey@asw2-d-eqiad# show |compare
[edit interfaces interface-range vlan-analytics1-d-eqiad]
     member xe-7/0/3 { ... }
+    member ge-8/0/4;
+    member ge-8/0/5;
+    member ge-8/0/6;
+    member ge-8/0/8;
+    member ge-8/0/9;
+    member ge-8/0/10;
+    member ge-8/0/11;
-    member ge-9/0/4;
-    member ge-9/0/5;
-    member ge-9/0/6;
-    member ge-9/0/8;
-    member ge-9/0/9;
-    member ge-9/0/10;
-    member ge-9/0/11;

Mentioned in SAL (#wikimedia-operations) [2019-03-07T09:15:10Z] <elukey> fixed vlan-analytics1-d-eqiad members on asw2-d-eqiad - T205507

Change 494857 merged by RobH:
[operations/dns@master] decom analytics100[12] production dns

https://gerrit.wikimedia.org/r/494857