Page MenuHomePhabricator

Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023
Open, MediumPublic

Description

We've finally shut down the old Analytics Kafka cluster. We've been trying to do this for 1.5 years!

Kafka has been shut down and removed. role::spare::system is applied and related puppet code has been removed. Old graphite based Grafana dashboards have been removed.

I believe the next steps are those starting here https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS_(with_network_switch_access).

Thank you!

kafka1012:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-a-eqiad:ge-2/0/17
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

kafka1013:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-a-eqiad:ge-2/0/18
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

kafka1014:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-c-eqiad:
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

kafka1020:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

kafka1022:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

kafka1023:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Details

Related Gerrit Patches:
operations/dns : masterdecom old kafka machines
operations/puppet : productiondecom old kafka brokers
operations/dns : masterdecom kafka10(1[234]|2[023]).eqiad.wmnet
operations/puppet : productionprofile::graphite::alerts: remove unused code

Event Timeline

jbond triaged this task as Medium priority.Jun 25 2019, 3:38 PM
jbond added projects: DC-Ops, ops-eqiad.

Mentioned in SAL (#wikimedia-operations) [2019-06-26T05:59:58Z] <elukey> systemctl mask + reset-failed kafka on kafka10[12-23] - T226517

elukey renamed this task from Reclaim/Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 to Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023.Jun 26 2019, 6:17 AM
elukey assigned this task to RobH.
elukey updated the task description. (Show Details)
elukey added a subscriber: RobH.

Change 519199 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::graphite::alerts: remove unused code

https://gerrit.wikimedia.org/r/519199

Change 519199 merged by Elukey:
[operations/puppet@production] profile::graphite::alerts: remove unused code

https://gerrit.wikimedia.org/r/519199

Cmjohnson moved this task from Backlog to UnRacking Tasks on the ops-eqiad board.Jun 27 2019, 4:25 PM
Cmjohnson moved this task from UnRacking Tasks to Decommission on the ops-eqiad board.
fdans moved this task from Incoming to Radar on the Analytics board.Jul 1 2019, 3:51 PM
RobH updated the task description. (Show Details)Jul 25 2019, 5:51 PM
RobH updated the task description. (Show Details)Jul 25 2019, 5:58 PM

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: kafka1012.eqiad.wmnet

  • kafka1012.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: kafka1013.eqiad.wmnet

  • kafka1013.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: kafka1014.eqiad.wmnet

  • kafka1014.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH updated the task description. (Show Details)Jul 25 2019, 6:00 PM

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: kafka1020.eqiad.wmnet

  • kafka1020.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: kafka1022.eqiad.wmnet

  • kafka1022.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: kafka1023.eqiad.wmnet

  • kafka1023.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH updated the task description. (Show Details)Jul 25 2019, 6:05 PM

Change 525613 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom kafka10(1[234]|2[023]).eqiad.wmnet

https://gerrit.wikimedia.org/r/525613

Change 525615 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom old kafka brokers

https://gerrit.wikimedia.org/r/525615

Change 525613 merged by RobH:
[operations/dns@master] decom kafka10(1[234]|2[023]).eqiad.wmnet

https://gerrit.wikimedia.org/r/525613

Change 525615 merged by RobH:
[operations/puppet@production] decom old kafka brokers

https://gerrit.wikimedia.org/r/525615

RobH updated the task description. (Show Details)
Cmjohnson reassigned this task from RobH to Jclark-ctr.Sep 19 2019, 8:44 PM
Cmjohnson added a subscriber: Cmjohnson.

John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once you finish so I can kill the switch ports.

Papaul added a subscriber: Papaul.Oct 8 2019, 3:40 AM

No switch port reference for kafka1014 and kafka1022 on asw2-c-eqiad or asw-c-eqaid

Papaul updated the task description. (Show Details)Oct 8 2019, 3:41 AM
Jclark-ctr updated the task description. (Show Details)Oct 11 2019, 10:42 PM
Jclark-ctr updated the task description. (Show Details)Nov 28 2019, 12:09 AM

Change 559587 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom old kafka machines

https://gerrit.wikimedia.org/r/559587

Change 559587 merged by RobH:
[operations/dns@master] decom old kafka machines

https://gerrit.wikimedia.org/r/559587

RobH updated the task description. (Show Details)