decom einsteinium
Open, NormalPublic

Description

einsteinium used to be the active Icinga server but has now been replaced by icinga1001. start to decom it (next week)

https://netbox.wikimedia.org/dcim/devices/1592/


This task will track the decommission of server einsteinium.wikimedia.org

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - remove einsteinium IP from the AQL whitelist
  • - unassign any owner from this task, check off completed steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)
Dzahn created this task.Fri, Nov 16, 11:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Nov 16, 11:54 PM
Dzahn updated the task description. (Show Details)Fri, Nov 16, 11:57 PM
Dzahn triaged this task as Normal priority.
Dzahn changed the task status from Open to Stalled.

changed netbox status from Active to Staged

https://netbox.wikimedia.org/dcim/devices/1592/

Change 473278 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove einsteinium as an alerting_host

https://gerrit.wikimedia.org/r/473278

Change 473276 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove jessie support

https://gerrit.wikimedia.org/r/473276

Change 474390 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom einsteinium remove from netboot and DHCP

https://gerrit.wikimedia.org/r/474390

Change 474392 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove icinga-old.wikimedia.org

https://gerrit.wikimedia.org/r/474392

Dzahn moved this task from Backlog to In progress on the monitoring board.Mon, Nov 26, 4:01 PM

Change 474392 merged by Dzahn:
[operations/dns@master] remove icinga-old.wikimedia.org

https://gerrit.wikimedia.org/r/474392

Dzahn added a comment.Mon, Nov 26, 6:24 PM
  • removed alerting_host role from einsteinium
  • removed einsteinium from network::constants
  • removed from whitelisted hosts on lists server
  • removed from nagios-nrpe-server (NRPE) allowed hosts
  • removed as "icinga partner" in Hiera

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473278/

Dzahn updated the task description. (Show Details)Mon, Nov 26, 6:30 PM
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2018-11-26T18:46:54Z] <mutante> removed allowed sender addresses from AQL (mail2SMS gateway) portal: @einsteinium @tegmen addresses T208824 T209738

Mentioned in SAL (#wikimedia-operations) [2018-11-26T18:50:46Z] <mutante> stopping icinga service on einsteinium, is a role(spare) now T209738

Change 475876 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] profile::icinga: stop using mysql module, rm jessie support

https://gerrit.wikimedia.org/r/475876

Change 475876 merged by Dzahn:
[operations/puppet@production] profile::icinga: stop using mysql module, rm jessie support

https://gerrit.wikimedia.org/r/475876

Change 473276 merged by Dzahn:
[operations/puppet@production] icinga: remove jessie support

https://gerrit.wikimedia.org/r/473276

Change 474390 merged by Dzahn:
[operations/puppet@production] decom einsteinium remove from netboot and DHCP

https://gerrit.wikimedia.org/r/474390

Dzahn removed Dzahn as the assignee of this task.Tue, Nov 27, 12:58 AM
Dzahn removed a project: Patch-For-Review.
Dzahn updated the task description. (Show Details)
Dzahn changed the task status from Stalled to Open.

Just a heads up that einsteinium is still running Icinga and contacting servers:

Nov 27 13:25:11 labstore1004 nrpe[18896]: Host 208.80.155.119 is not allowed to talk to us!
Nov 27 13:25:24 labstore1004 nrpe[18984]: Host 208.80.155.119 is not allowed to talk to us!
Nov 27 13:25:24 labstore1004 nrpe[18986]: Host 208.80.155.119 is not allowed to talk to us!

Mentioned in SAL (#wikimedia-operations) [2018-11-27T15:35:17Z] <mutante> einsteinium - stopped icinga, stopped nsca, stopped rsyncd, killall -u icinga, killall -u nagios ... T209738

Dzahn added a comment.Tue, Nov 27, 3:37 PM

@GTirloni Thanks! I was pretty sure i had already told it to stop the service.. but it wouldn't listen, i had to insist and killed everything ^

Mentioned in SAL (#wikimedia-operations) [2018-11-27T16:17:14Z] <mutante> einsteinium - apt-get remove --purge icinga nsca; apt-get autoremove ; apt-get remove --purge icinga-doc icinga-common icinga-cgi-bin icinga-cgi; apt-get remove --purge monitoring-plugin* ; rm /etc/rsync.d/frag-icinga* T209738