Page MenuHomePhabricator

Move kafkamon hosts to Debian Buster
Closed, ResolvedPublic

Description

The kafkamon hosts need to be upgraded to Buster. This is the actual status:

elukey@kafkamon1001:~$ dpkg -l | grep burrow
ii  burrow                               1.1.0-1                           amd64        Kafka Burrow
ii  prometheus-burrow-exporter           0.0.5-1                           amd64        Prometheus exporter for the Kafka Burrow daemon.

It is very nice that Burrow is packaged for Buster: https://packages.debian.org/buster/burrow, and it is version 1.2.1. The prometheus exporter doesn't seem to be in Debian yet, but we are not far from the last upstream (https://github.com/jirwin/burrow_exporter/releases) so I'd say we'd just need to rebuild for Stretch (or copy the package, IIRC we built it for sid creating a single go binary blob).

What I would do then is the following:

  1. Create kafkamon1002 and kafkamon2002 with Buster.
  2. Temporary fork role::kafka::monitoring to role::kafka::monitoring_buster (or a similar approach) to avoid scraping duplicate metrics from the new hosts while they are being prepared.
  3. Deploy Burrow via Debian upstream package, configure the host and see if it works well.
  4. Once we are ready, we switch the prometheus masters to the new hosts.
  5. Clean up forked role
  6. We drop kafkamon1001/2001.

Since this is some work that is shared between Analytics and SRE, should we try to do this work together to speed it up? Thoughts?

Event Timeline

Thanks for creating the task @elukey, working together on this SGTM

I added a step to the description to help avoid introducing duplicate metrics while both prod and buster upgrade hosts are running at the same time.

I'll volunteer for step 1, and we can go from there!

Thanks @herron! I had a chat with Cole on IRC to establish a clear ownership for these hosts, I am wondering if Observability could be a better owner than Analytics nowadays?

RLazarus triaged this task as Medium priority.

I copied prometheus-burrow-exporter to buster-wikimedia and tested in a WMCS instance. puppet runs successfully and burrow from Buster obviously fails (as expected, haven't tried with real kafka brokers yet). I think @elukey your plan looks good to me!

Burrow + exporter seems to be working as expected when connected to real kafka/zk, we are good to go with production VMs

Change 618359 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafkamon: add role::kafak::monitoring_buster, assign kafkamon[12]002

https://gerrit.wikimedia.org/r/618359

Change 618359 merged by Herron:
[operations/puppet@production] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002

https://gerrit.wikimedia.org/r/618359

Change 619310 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] zookeeper: allow new kafkamon vms to contact zookeeper main clusters

https://gerrit.wikimedia.org/r/619310

Change 619310 merged by Herron:
[operations/puppet@production] zookeeper: allow new kafkamon vms to contact zookeeper main clusters

https://gerrit.wikimedia.org/r/619310

@herron Hi! What is the status of the task? Anything that I can help with?

Hey @elukey, prep work is done for the new hosts. Will be performing cut-over in the near future, will keep you on the cc.

Change 622836 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: begin scraping buster kafkamon hosts

https://gerrit.wikimedia.org/r/622836

Change 622836 merged by Herron:
[operations/puppet@production] prometheus: switch over to buster kafkamon hosts

https://gerrit.wikimedia.org/r/622836

Change 623902 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: switch over to buster kafkamon hosts

https://gerrit.wikimedia.org/r/623902

Change 623902 merged by Herron:
[operations/puppet@production] prometheus: switch over to buster kafkamon hosts

https://gerrit.wikimedia.org/r/623902

The buster kafkamon hosts are now live. Will let them settle for a bit before moving on to cleanup/teardown of the old hosts.

@herron can this task be closed out and possibly create a new cleanup the old hosts if this work still needs to be done?

Change 713307 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] retire kafka::monitoring and kafkamon[12]001

https://gerrit.wikimedia.org/r/713307

Change 713307 merged by Herron:

[operations/puppet@production] retire role::kafka::monitoring and kafkamon[12]001

https://gerrit.wikimedia.org/r/713307

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon1001.eqiad.wmnet

  • kafkamon1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon2001.codfw.wmnet

  • kafkamon2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

Old hosts have been retired and the duplicate role cleaned up, resolving!

@herron - I have a quick question.

In the task description it says:

Temporary fork role::kafka::monitoring to role::kafka::monitoring_buster

Are we intending to switch the role back from role::kafka::monitoring_buster to role::kafka::monitoring - or are we happy to leave it as role::kafka::monitoring_buster?

There's a cumin alias that references only the old role name here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/cumin/aliases.yaml.erb$172 and it's now causing periodic alerts in Icinga.

I'm wondering whether it would be best to update the alias to the new role, or change the name of the role back, along with the prometheus scrape targets.

Change 714398 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] cumin: update kafka::monitoring alias

https://gerrit.wikimedia.org/r/714398

I opted to remove role::kafka::monitoring in favor of role::kafka::monitoring_buster so the config wouldn't be disrupted when retiring the old hosts. Will upload a patch to update the cumin alias.

Change 714398 merged by Herron:

[operations/puppet@production] cumin: update kafka::monitoring alias

https://gerrit.wikimedia.org/r/714398

Change 714593 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] cleanup kafkamon role description

https://gerrit.wikimedia.org/r/714593

Change 714593 merged by Herron:

[operations/puppet@production] cleanup kafkamon role description

https://gerrit.wikimedia.org/r/714593