Page MenuHomePhabricator

decom einsteinium
Closed, ResolvedPublic

Description

einsteinium used to be the active Icinga server but has now been replaced by icinga1001. start to decom it (next week)

https://netbox.wikimedia.org/dcim/devices/1592/


This task will track the decommission-hardware of server einsteinium.wikimedia.org

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - remove einsteinium IP from the AQL whitelist
  • - unassign any owner from this task, check off completed steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update status in netbox (inventory for decom, planned for spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-d-eqiad:ge-3/0/7
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox status with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Dzahn changed the task status from Open to Stalled.Nov 16 2018, 11:57 PM
Dzahn triaged this task as Medium priority.
Dzahn updated the task description. (Show Details)

changed netbox status from Active to Staged

https://netbox.wikimedia.org/dcim/devices/1592/

Change 473278 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove einsteinium as an alerting_host

https://gerrit.wikimedia.org/r/473278

Change 473276 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove jessie support

https://gerrit.wikimedia.org/r/473276

Change 474390 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom einsteinium remove from netboot and DHCP

https://gerrit.wikimedia.org/r/474390

Change 474392 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove icinga-old.wikimedia.org

https://gerrit.wikimedia.org/r/474392

Change 474392 merged by Dzahn:
[operations/dns@master] remove icinga-old.wikimedia.org

https://gerrit.wikimedia.org/r/474392

  • removed alerting_host role from einsteinium
  • removed einsteinium from network::constants
  • removed from whitelisted hosts on lists server
  • removed from nagios-nrpe-server (NRPE) allowed hosts
  • removed as "icinga partner" in Hiera

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473278/

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2018-11-26T18:46:54Z] <mutante> removed allowed sender addresses from AQL (mail2SMS gateway) portal: @einsteinium @tegmen addresses T208824 T209738

Mentioned in SAL (#wikimedia-operations) [2018-11-26T18:50:46Z] <mutante> stopping icinga service on einsteinium, is a role(spare) now T209738

Change 475876 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] profile::icinga: stop using mysql module, rm jessie support

https://gerrit.wikimedia.org/r/475876

Change 475876 merged by Dzahn:
[operations/puppet@production] profile::icinga: stop using mysql module, rm jessie support

https://gerrit.wikimedia.org/r/475876

Change 473276 merged by Dzahn:
[operations/puppet@production] icinga: remove jessie support

https://gerrit.wikimedia.org/r/473276

Change 474390 merged by Dzahn:
[operations/puppet@production] decom einsteinium remove from netboot and DHCP

https://gerrit.wikimedia.org/r/474390

Dzahn changed the task status from Stalled to Open.Nov 27 2018, 12:58 AM
Dzahn removed Dzahn as the assignee of this task.
Dzahn removed a project: Patch-For-Review.
Dzahn updated the task description. (Show Details)

Just a heads up that einsteinium is still running Icinga and contacting servers:

Nov 27 13:25:11 labstore1004 nrpe[18896]: Host 208.80.155.119 is not allowed to talk to us!
Nov 27 13:25:24 labstore1004 nrpe[18984]: Host 208.80.155.119 is not allowed to talk to us!
Nov 27 13:25:24 labstore1004 nrpe[18986]: Host 208.80.155.119 is not allowed to talk to us!

Mentioned in SAL (#wikimedia-operations) [2018-11-27T15:35:17Z] <mutante> einsteinium - stopped icinga, stopped nsca, stopped rsyncd, killall -u icinga, killall -u nagios ... T209738

@GTirloni Thanks! I was pretty sure i had already told it to stop the service.. but it wouldn't listen, i had to insist and killed everything ^

Mentioned in SAL (#wikimedia-operations) [2018-11-27T16:17:14Z] <mutante> einsteinium - apt-get remove --purge icinga nsca; apt-get autoremove ; apt-get remove --purge icinga-doc icinga-common icinga-cgi-bin icinga-cgi; apt-get remove --purge monitoring-plugin* ; rm /etc/rsync.d/frag-icinga* T209738

Change 479857 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom einsteinium production dns entries

https://gerrit.wikimedia.org/r/479857

Change 479858 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom einsteinium

https://gerrit.wikimedia.org/r/479858

wmf-decommission-host was executed by robh for einsteinium.wikimedia.org and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a project: ops-eqiad.
RobH added a subscriber: RobH.

ready for disk wipe and remainder of steps

Change 479857 merged by RobH:
[operations/dns@master] decom einsteinium production dns entries

https://gerrit.wikimedia.org/r/479857

Change 479858 merged by RobH:
[operations/puppet@production] decom einsteinium

https://gerrit.wikimedia.org/r/479858

papaul@asw2-d-eqiad# show | compare 
[edit interfaces]
-   ge-3/0/7 {
-       description einsteinium;
-       enable;
-   }

Change 549898 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove mgmt entries for einsteinium

https://gerrit.wikimedia.org/r/549898

Change 549898 merged by Dzahn:
[operations/dns@master] remove mgmt entries for einsteinium

https://gerrit.wikimedia.org/r/549898