Page MenuHomePhabricator

kafkamon: upgrade to bullseye
Closed, ResolvedPublic

Description

Tracking task for upgrading kafkamon hosts to bullseye

High level checklist:

  • Create bullseye VMs
    • kafkamon1003
    • kafkamon2003
  • Create kafka::monitoring_bullseye role
  • Validate config on bullseye
  • Cut prometheus configs over to kafka::monitoring_bullseye class
  • Retire buster kafkamon VMs

Related Objects

Event Timeline

herron created this task.

Change 912341 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: add bullseye role and node assignments

https://gerrit.wikimedia.org/r/912341

Change 912341 merged by Herron:

[operations/puppet@production] kafkamon: add bullseye role and node assignments

https://gerrit.wikimedia.org/r/912341

Change 912930 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: add monitoring bullseye yaml

https://gerrit.wikimedia.org/r/912930

Change 912930 merged by Herron:

[operations/puppet@production] kafkamon: add monitoring bullseye yaml

https://gerrit.wikimedia.org/r/912930

Change 912932 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: add monitoring bullseye clusters

https://gerrit.wikimedia.org/r/912932

Change 912932 merged by Herron:

[operations/puppet@production] kafkamon: add monitoring bullseye clusters

https://gerrit.wikimedia.org/r/912932

Change 912979 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: transition to firewall definition

https://gerrit.wikimedia.org/r/912979

Change 912979 merged by Herron:

[operations/puppet@production] kafkamon: transition to firewall definition

https://gerrit.wikimedia.org/r/912979

Change 914787 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: add kafkamon[12]003 to fw allow list

https://gerrit.wikimedia.org/r/914787

Change 914787 merged by Herron:

[operations/puppet@production] kafkamon: add kafkamon[12]003 to fw allow list

https://gerrit.wikimedia.org/r/914787

Change 914876 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: cut over to bullseye exporters

https://gerrit.wikimedia.org/r/914876

Mentioned in SAL (#wikimedia-operations) [2023-05-04T13:48:26Z] <herron> switching to bullseye kafka monitoring hosts T335424

Change 914876 merged by Herron:

[operations/puppet@production] kafkamon: cut over to bullseye exporters

https://gerrit.wikimedia.org/r/914876

Change 915694 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafkamon: cleanup buster classes

https://gerrit.wikimedia.org/r/915694

Change 915694 merged by Herron:

[operations/puppet@production] kafkamon: cleanup buster classes

https://gerrit.wikimedia.org/r/915694

herron claimed this task.
herron triaged this task as Medium priority.
herron updated the task description. (Show Details)

done!

fgiunchedi added a subscriber: fgiunchedi.

I noticed the puppet failed for kafkamon*002 (since 20 days!) and indeed the old VMs are still up:

(1) kafkamon2002.codfw.wmnet                                                                              
----- OUTPUT of 'uptime' -----                                                                            
 12:24:03 up 159 days, 20:04,  1 user,  load average: 0.69, 0.28, 0.23                        
(1) kafkamon1002.eqiad.wmnet                                                                              
----- OUTPUT of 'uptime' -----                                                                            
 12:24:03 up 159 days, 20:12,  0 users,  load average: 1.24, 0.78, 0.73

Good catch thanks, I had to take a look through my history on cumin1001 because I remember decomming these. Turns out ran the cookbook with the -d (dry run) flag enabled 🤦‍♂️ Will re-run these decoms now.

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon1002.eqiad.wmnet

  • kafkamon1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon2002.codfw.wmnet

  • kafkamon2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon2002.codfw.wmnet

  • kafkamon2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

I think we're good here. The VM was decommed, but the netbox dns cookbook ran failed due to multiple people running it at the same time. The dns record was removed, just not by this cookbook run