Page MenuHomePhabricator

Decommission old eqiad logstash hardware hosts logstash100[456]
Open, NormalPublic

Description

Services have been migrated away from logstash100[456] and the hosts have been transitioned to role spare::system.

Ready for the hardware to be turned down and decommissioned per the standard process.

logstash1004:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw-a-eqiad:ge-4/0/3
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

logstash1005:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-b-eqiad:ge-4/0/1
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.
End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

logstash1006:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-d-eqiad:ge-3/0/25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.
End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

Event Timeline

herron triaged this task as Normal priority.Mar 4 2019, 2:56 PM
herron created this task.
RobH claimed this task.Mar 7 2019, 9:40 PM
RobH updated the task description. (Show Details)
RobH added subscribers: RobH, mark, faidon.

Decision on reclaim or decommission: These hosts were purchased on April 13, 2015, and support expired in April 2018. The systems are just shy of 4 years old. At 5 years, we would simply decommission when they go spare. A decision will need to be attached to this task from @faidon or @mark in regards to weather to unrack these and dispose of them, or reclaim them to out of warranty spares.

RobH moved this task from Backlog to Decommission on the ops-eqiad board.Mar 7 2019, 9:41 PM
RobH updated the task description. (Show Details)
RobH added a comment.Mar 7 2019, 9:45 PM

Chatted with @faidon about this over IRC, we can dispose of these rather than reclaim to spares. So they'll get added to the decom tracking sheets and unracked.

RobH updated the task description. (Show Details)Mar 7 2019, 9:46 PM

wmf-decommission-host was executed by robh for logstash1004.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for logstash1005.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for logstash1006.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

Change 495142 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] logstash100[456] decommission

https://gerrit.wikimedia.org/r/495142

Change 495143 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom logstash100[456] prod dns

https://gerrit.wikimedia.org/r/495143

Change 495143 merged by RobH:
[operations/dns@master] decom logstash100[456] prod dns

https://gerrit.wikimedia.org/r/495143

Change 495142 merged by RobH:
[operations/puppet@production] logstash100[456] decommission

https://gerrit.wikimedia.org/r/495142

RobH reassigned this task from RobH to Cmjohnson.Mar 7 2019, 10:09 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
Dzahn added a subscriber: Dzahn.Mar 27 2019, 10:03 AM

BEWARE. These hosts have not been removed from all places in puppet yet, though they are already gone from DNS. This caused issues on all logstash hosts today, because when the ferm rules were reloaded by puppet due to an unrelated change, ferm failed to restart because it could not lookup logstash1004 in DNS anymore.

~/puppet$ grep -r logstash1004 *
hieradata/role/common/logstash.yaml:      - logstash1004.eqiad.wmnet
hieradata/role/common/logstash.yaml:      - logstash1004.eqiad.wmnet
hieradata/role/common/logstash/elasticsearch.yaml:      - logstash1004.eqiad.wmnet
hieradata/role/common/logstash/elasticsearch.yaml:      - logstash1004.eqiad.wmnet

Change 499433 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] logstash: remove logstash1004,1005,1006 from Hiera

https://gerrit.wikimedia.org/r/499433

Dzahn reassigned this task from Cmjohnson to herron.Mar 27 2019, 10:17 AM
Dzahn added a subscriber: Cmjohnson.

Change 499433 merged by Dzahn:
[operations/puppet@production] logstash: remove logstash1004,1005,1006 from Hiera

https://gerrit.wikimedia.org/r/499433

2019-03-27

    11:04 mutante: re-enabled puppet on logstash1007 through 1011 - then on logstash*
    10:53 mutante: enabling and running puppet on logstash1007
    10:49 mutante: disabling puppet on logstash* via cumin
06:54 <+icinga-wm> RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational
07:00 <+icinga-wm> RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational
07:00 <+icinga-wm> RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational
07:02 <+icinga-wm> RECOVERY - Check systemd state on logstash1010 is OK: OK - running: The system is fully operational
07:04 <+icinga-wm> RECOVERY - Check systemd state on logstash1011 is OK: OK - running: The system is fully operational
Dzahn reassigned this task from herron to Cmjohnson.Mar 27 2019, 11:25 AM

now it should be ok to continue. at least i don't see the hosts in puppet repo anymore and the issue on logstash has been resolved

RobH added a comment.Jul 24 2019, 7:16 PM

wipe is running on all 4 internal disks for T217556 and on the external usb disk for T212457.

RobH mentioned this in Unknown Object (Task).Jul 24 2019, 7:16 PM
Cmjohnson reassigned this task from Cmjohnson to Jclark-ctr.Aug 8 2019, 3:07 PM
Cmjohnson added a subscriber: Jclark-ctr.

@Jclark-ctr Please wipe logstash1004 and 1005 and then remove from rack and update netbox and the google tracking sheet.
https://docs.google.com/spreadsheets/d/1JhjeV3cXfIzIyekJrnA2nNFFDGTT4SeLmyAFvDa4HmA/edit#gid=2026042311

Cmjohnson updated the task description. (Show Details)Aug 8 2019, 3:07 PM
Dzahn removed a subscriber: Dzahn.Aug 8 2019, 10:23 PM
Jclark-ctr updated the task description. (Show Details)Aug 9 2019, 9:34 PM

@Jclark-ctr has this ben done? We need the space in rack B2 so please make this a priority item. Thanks!

Jclark-ctr added a comment.EditedAug 20 2019, 6:45 PM

@Cmjohnson Finished wiping i will be removing from rack shortly

Change 531293 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns entries for logstash1000[4-6]

https://gerrit.wikimedia.org/r/531293

Change 531293 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns entries for logstash1000[4-6]

https://gerrit.wikimedia.org/r/531293

Jclark-ctr updated the task description. (Show Details)
Jclark-ctr updated the task description. (Show Details)Fri, Oct 11, 11:06 PM