Page MenuHomePhabricator

Upgrade kafka-main nodes to buster
Closed, ResolvedPublic

Description

The kafka main nodes are split between buster and stretch:

  • kafka-main[12]00[1-3] are on stretch
  • kafka-main[12]00[45] are on buster

The [45] nodes were added as part of T225005 and they got Buster straight away, the other ones are still todo. In theory we could do the following:

  1. Create partman re-use recipes for kafka main, to keep the /srv partition while reimaging
  2. Reimage one node at the time to Buster

From the past experience of Kafka Jumbo it seems fine to keep the list of brokers that clients use in their config intact, since they should be able to react to brokers not being reachable without issues. I'd avoid Bullseye for the moment since we have the puppet code already working for Buster, and it should be quick and easy to upgrade the remaining nodes.

I can take care of it if you like the idea :)

Event Timeline

Change 742926 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: add reuse recipe for kafka-main hosts

https://gerrit.wikimedia.org/r/742926

Another interesting thing to consider is:

===== NODE GROUP =====                                                                                                                                                                                                                  
(4) kafka-main[2001-2003].codfw.wmnet,kafka-main1001.eqiad.wmnet                                                                                                                                                                        
----- OUTPUT of 'id kafka' -----                                                                                                                                                                                                        
uid=497(kafka) gid=498(kafka) groups=498(kafka)                                                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                                                                                  
(26) kafka-jumbo[1001-1009].eqiad.wmnet,kafka-logging[2001-2003].codfw.wmnet,kafka-logging[1001-1003].eqiad.wmnet,kafka-main[2004-2005].codfw.wmnet,kafka-main[1002-1005].eqiad.wmnet,kafka-test[1006-1010].eqiad.wmnet                 
----- OUTPUT of 'id kafka' -----                                                                                                                                                                                                        
uid=499(kafka) gid=499(kafka) groups=499(kafka)

So the reimage, even if we preserve /srv, will likely end up in kafka files owned by the wrong user (the old kafka id before the os reinstall). A chown -R easily solves the issue, but we could think about standardizing the kafka user in puppet and apply the common id across our clusters.

Change 742926 merged by Elukey:

[operations/puppet@production] install_server: add reuse recipe for kafka-main hosts

https://gerrit.wikimedia.org/r/742926

Change 742964 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: rename reuse-kafka-main recipe

https://gerrit.wikimedia.org/r/742964

Change 742964 merged by Elukey:

[operations/puppet@production] install_server: rename reuse-kafka-main recipe

https://gerrit.wikimedia.org/r/742964

The kafka-main[12]00[45] nodes running buster don't have, afaics, any specific bits related to buster, so there seems to be no need to do anything else than reimage the nodes.

Change 742969 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: set test reuse recipe for kafka-main2003

https://gerrit.wikimedia.org/r/742969

The layout of files under /srv is pretty simple, the kafka dir is the only relevant one and all files are owned by kafka:kafka. A quick chown -R kafka:kafka should be sufficient after the reimage.

An alternative could be to:

  1. Select a fixed uid/gid in puppet's admin module (added commented)
  2. Create a profile::kafka::user that is enabled by a hiera flag, to deploy a kafka user with fixed uid/gid.
  3. Disable puppet on the target node to reimage, and enable the hiera flag for it.
  4. Stop/mask kafka, modify kafka user/group (via usermod etc..), reimage

After the above the host should come up without any need of chown. When all brokers will transition to the new scheme the profile::kafka::user will not be needed.

Change 742969 merged by Elukey:

[operations/puppet@production] install_server: set test reuse recipe for kafka-main2003

https://gerrit.wikimedia.org/r/742969

Change 743130 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] admin: reserve kafka uid/gid

https://gerrit.wikimedia.org/r/743130

Change 743150 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add profile::kafka::user to Kafka Brokers and Mirror Makers instances

https://gerrit.wikimedia.org/r/743150

Change 743163 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka test to fixed gid/uid for user kafka

https://gerrit.wikimedia.org/r/743163

@elukey would it facilitate the work if we disabled eventgate-main in eqiad during the work?

@elukey would it facilitate the work if we disabled eventgate-main in eqiad during the work?

I would keep eventgate pooled while doing the maintenance, on paper there shouldn't be any impact to it since a good kafka client should behave fine in these uses cases. If you are ok I'd keep ready-to-execute depool commands (if we start from main-codfw, to depool eventgate main codfw etc..) and observe eventgate's behavior during the first reimage. If it fails for any reason then it means something is really wrong and needs to fixed (sort of chaos engineering, better to see it failing while we are all watching rather than during a weekend). Lemme know your thoughts :)

Change 743130 merged by Elukey:

[operations/puppet@production] admin: reserve kafka uid/gid

https://gerrit.wikimedia.org/r/743130

elukey@kafka-main1001:~$ sudo find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media -o -path /srv/kafka/data \) -prune -false -o -user kafka
/etc/kafka/ssl
/etc/kafka/ssl/kafka_main-eqiad_broker.keystore.jks
/etc/kafka/ssl/truststore.jks
/etc/kafka/mirror
/etc/kafka/mirror/ssl
/etc/kafka/mirror/ssl/kafka_mirror_maker.keystore.jks
/etc/kafka/mirror/ssl/truststore.jks
/var/log/kafka
/var/log/kafka/controller.log.2
/var/log/kafka/controller.log.4
/var/log/kafka/controller.log.1
/var/log/kafka/server.log.3
/var/log/kafka/kafka-mirror-main-codfw_to_main-eqiad@0.log.1
/var/log/kafka/kafkaServer-gc.log.0
/var/log/kafka/controller.log.3
/var/log/kafka/server.log
/var/log/kafka/server.log.1
/var/log/kafka/kafkaServer-gc.log.3
/var/log/kafka/kafka-mirror-main-codfw_to_main-eqiad@0.log
/var/log/kafka/server.log.2
/var/log/kafka/kafka-request.log
/var/log/kafka/state-change.log
/var/log/kafka/kafka-authorizer.log
/var/log/kafka/kafkaServer-gc.log.2.current
/var/log/kafka/kafkaServer-gc.log.4
/var/log/kafka/kafkaServer-gc.log.1
/var/log/kafka/controller.log
/var/log/kafka/log-cleaner.log
/srv/kafka/data
/tmp/snappy-1.1.7-a8408a72-adcb-4d31-9219-e7b5addb32db-libsnappyjava.so
/tmp/hsperfdata_kafka
/tmp/hsperfdata_kafka/89989
/tmp/hsperfdata_kafka/88568
/tmp/snappy-1.1.7-1847af45-909a-4847-9230-5e99b73641d8-libsnappyjava.so

We can reuse something like T231067#6851675 to chown uid/gid automatically, node by node.

#!/bin/bash

set -x

change_uid() {
    # $1 new uid
    # $2 username
    if id "$2" &>/dev/null
    then
        OLD_UID=$(id -u $2)
        usermod -u $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -user $OLD_UID -print0 | xargs -0 chown $1
    fi
}

change_gid() {
    # $1 new gid
    # $2 username
    if getent group $2 &>/dev/null
    then
        OLD_GID=$(getent group $2 | cut -d ":" -f 3)
        groupmod -g $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -group $OLD_GID -print0  | xargs -0 chgrp $1
    fi
}

## hdfs


change_uid 916 kafka
change_gid 916 kafka

Change 743150 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: allow to specify kafka uid/gid

https://gerrit.wikimedia.org/r/743150

Change 743163 merged by Elukey:

[operations/puppet@production] Move kafka test to fixed gid/uid for user kafka

https://gerrit.wikimedia.org/r/743163

The kafka-main codfw cluster is running with fixed gid/uid now, next steps:

sudo cookbook sre.hosts.reimage --os buster -t T296641 kafka-main2003
sudo cookbook sre.hosts.reimage --os buster -t T296641 kafka-main2002
sudo cookbook sre.hosts.reimage --os buster -t T296641 kafka-main2001

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster executed with errors:

  • kafka-main2003 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
elukey changed the task status from Open to Stalled.Dec 10 2021, 6:49 AM

This task is not actionable until the firmware of kafka-main2003 is updated, see task T297422

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster completed:

  • kafka-main2003 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112161031_elukey_30224_kafka-main2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

The BIOS+NIC upgrade for kafka-main2003 worked, it is now running buster. The partman reuse recipe and the fixed uid/gid worked as well.

Next steps:

  1. Upgrade BIOS+NIC on kafka-main200[1,2] (see subtask)
  2. Reimage the nodes to Buster
  3. Rollout the fixed uid/gid change to Kafka Main eqiad
  4. Upgrade BIOS+NIC on kafka-main100[1-3]
  5. Reimage the nodes to Buster
elukey changed the task status from Stalled to Open.Dec 16 2021, 5:24 PM

Thanks to Papaul, kafka-main200[12] are now ready to be reimaged :)

Change 747999 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: set the reuse recipe for all kafka-main hosts

https://gerrit.wikimedia.org/r/747999

Change 747999 merged by Elukey:

[operations/puppet@production] install_server: set the reuse recipe for all kafka-main hosts

https://gerrit.wikimedia.org/r/747999

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2002.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2002.codfw.wmnet with OS buster completed:

  • kafka-main2002 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112170923_elukey_29923_kafka-main2002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2001.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2001.codfw.wmnet with OS buster completed:

  • kafka-main2001 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112171421_elukey_4434_kafka-main2001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Kafka main codfw on buster!

Next steps:

  • Rollout the fixed uid/gid change to Kafka main eqiad
  • Upgrade BIOS+NIC on kafka-main100[1-3]
  • Reimage the nodes to Buster

Change 752613 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::main: apply kafka fixed uid/gid to eqiad

https://gerrit.wikimedia.org/r/752613

Mentioned in SAL (#wikimedia-operations) [2022-01-10T10:38:51Z] <elukey> stop/start kafka daemons on kafka-main1* nodes to move the kafka user to fixed uid/gid - T296641

Change 752613 merged by Elukey:

[operations/puppet@production] role::kafka::main: apply kafka fixed uid/gid to eqiad

https://gerrit.wikimedia.org/r/752613

Next steps:

  • Upgrade BIOS+NIC on kafka-main100[1-3] - T298867
  • Reimage the nodes to Buster

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1003.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1003.eqiad.wmnet with OS buster completed:

  • kafka-main1003 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201130808_elukey_10279_kafka-main1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1002.eqiad.wmnet with OS buster completed:

  • kafka-main1002 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201130926_elukey_22659_kafka-main1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1001.eqiad.wmnet with OS buster completed:

  • kafka-main1001 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201131029_elukey_21705_kafka-main1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
elukey claimed this task.

All nodes on buster!