Page MenuHomePhabricator

Decommission lvs1007-1012
Open, NormalPublic

Description

The following hosts have been reimaged as spares and need to be decommissioned:

lvs1010
lvs1011
lvs1012

lvs1007-1009 are currently unreachable but still present in site.pp. Have they just been unplugged from the network or are they decommissioned already?

decom checklist

lvs1007:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host - please note this host is unreachable via production or mgmt network, and this step cannot be confirmed
  • - power down host - please note this host is unreachable via production or mgmt network, and this step cannot be confirmed
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

lvs1008:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host - please note this host is unreachable via production or mgmt network, and this step cannot be confirmed
  • - power down host - please note this host is unreachable via production or mgmt network, and this step cannot be confirmed
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

lvs1009:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host - please note this host is unreachable via production or mgmt network, and this step cannot be confirmed
  • - power down host - please note this host is unreachable via production or mgmt network, and this step cannot be confirmed
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

lvs1010:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

lvs1011:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

lvs1012:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Event Timeline

ema created this task.Nov 2 2018, 1:40 PM
Restricted Application added a project: Operations. · View Herald TranscriptNov 2 2018, 1:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Normal priority.Nov 2 2018, 1:44 PM
ema moved this task from Triage to Hardware on the Traffic board.
ema added projects: decommission, ops-eqiad.
ayounsi added a subscriber: ayounsi.Nov 5 2018, 3:36 PM
Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Nov 7 2018, 1:49 PM
RobH updated the task description. (Show Details)Mar 6 2019, 7:25 PM
RobH added a subscriber: RobH.
RobH claimed this task.Mar 6 2019, 7:33 PM
RobH updated the task description. (Show Details)
RobH added a comment.EditedMar 6 2019, 7:37 PM

lvs101[012] all exist on asw-c-eqiad (but have ports also reserved on asw2-c-eqiad):

robh@asw-c-eqiad> show interfaces descriptions | grep lvs1010 
xe-8/0/23       up    up   lvs1010
robh@asw-c-eqiad> show interfaces descriptions | grep lvs1011    
xe-8/0/24       up    up   lvs1011
robh@asw-c-eqiad> show interfaces descriptions | grep lvs1012    
xe-8/0/25       up    up   lvs1012
robh@asw2-c-eqiad> show interfaces descriptions | grep lvs1010 
xe-8/0/23                  lvs1010
robh@asw2-c-eqiad> show interfaces descriptions | grep lvs1011    
xe-8/0/24                  lvs1011
robh@asw2-c-eqiad> show interfaces descriptions | grep lvs1012    
xe-8/0/25                  lvs1012

Had to pull lvs-balancer out as its the last members on old switch stack.

[edit interfaces interface-range disabled]
     member xe-8/0/10 { ... }
+    member xe-8/0/23;
+    member xe-8/0/24;
+    member xe-8/0/25;
[edit interfaces]
-   interface-range LVS-balancer {
-       member-range xe-8/0/23 to xe-8/0/25;
-       unit 0 {
-           family ethernet-switching {
-               port-mode trunk;
-               vlan {
-                   members public1-c-eqiad;
-               }
-               native-vlan-id private1-c-eqiad;
-           }
-       }
-   }
RobH added a comment.Mar 6 2019, 7:49 PM

Since this is lvs, they are on every switch stack =P

Row A: lvs101[012] don't show on either asw-a-eqiad or asw2-a-eqiad.

Row B: doesnt show on asw2-b-eqiad, asw-b-eqiad is decommissioned.

Row D (asw2-d-eqiad, as asw-d-eqiad is fully decommissioned)

robh@asw2-d-eqiad> show interfaces descriptions | grep lvs1010 
xe-7/0/45       up    down lvs1010:eth3 {#3916}

{master:2}
robh@asw2-d-eqiad> show interfaces descriptions | grep lvs1011    
xe-7/0/46       up    down lvs1011:eth3 {#3917}

{master:2}
robh@asw2-d-eqiad> show interfaces descriptions | grep lvs1012    
xe-7/0/47       up    down lvs1012:eth3 {#3918}

robh@asw2-d-eqiad# show | compare 
[edit interfaces interface-range LVS-cross-row]
-    member-range xe-7/0/45 to xe-7/0/47;
[edit interfaces interface-range disabled]
     member ge-1/0/7 { ... }
+    member xe-7/0/45;
+    member xe-7/0/46;
+    member xe-7/0/47;
RobH updated the task description. (Show Details)Mar 7 2019, 5:01 PM
RobH added a comment.Mar 7 2019, 5:07 PM

lvs100[789] network port disabling:

robh@asw-c-eqiad# show | compare 
[edit interfaces interface-range LVS-cross-row]
-    member-range xe-8/0/26 to xe-8/0/28;
[edit interfaces interface-range disabled]
     member xe-8/0/25 { ... }
+    member xe-8/0/26;
+    member xe-8/0/27;
+    member xe-8/0/28;
robh@asw2-d-eqiad# show | compare 
[edit interfaces interface-range LVS-cross-row]
-    member-range xe-2/0/45 to xe-2/0/47;
[edit interfaces interface-range disabled]
     member ge-8/0/7 { ... }
+    member xe-2/0/45;
+    member xe-2/0/46;
+    member xe-2/0/47;
RobH updated the task description. (Show Details)Mar 7 2019, 5:12 PM

Change 494985 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] lvs1007-lvs1012 decommission

https://gerrit.wikimedia.org/r/494985

Change 494985 merged by RobH:
[operations/puppet@production] lvs1007-lvs1012 decommission

https://gerrit.wikimedia.org/r/494985

wmf-decommission-host was executed by robh for lvs1007.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Skipped downtime host on Icinga (likely already removed)
  • Skipped downtime mgmt interface on Icinga (likely already removed)
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs1008.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Skipped downtime host on Icinga (likely already removed)
  • Skipped downtime mgmt interface on Icinga (likely already removed)
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs1009.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Skipped downtime host on Icinga (likely already removed)
  • Skipped downtime mgmt interface on Icinga (likely already removed)
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs1010.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs1011.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs1012.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

Change 494990 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom lvs1007-lvs1012 production dns entries

https://gerrit.wikimedia.org/r/494990

Change 494990 merged by RobH:
[operations/dns@master] decom lvs1007-lvs1012 production dns entries

https://gerrit.wikimedia.org/r/494990

RobH reassigned this task from RobH to Cmjohnson.Mar 7 2019, 5:33 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)