Page MenuHomePhabricator

Upgrade kafka-jumbo100[7-9] to Debian Bullseye
Closed, ResolvedPublic

Event Timeline

I was looking at how to ensure that kafka-jumbo100[7-9] would retain their data during a reimage, and found the following netboot config, thanks to @BTullis:

case $(debconf-get netcfg/get_hostname) in \
     ...
    kafka-jumbo100[1-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \
    kafka-jumbo101[0-5]) echo partman/custom/kafka-jumbo.cfg ;; \
    ...

What's interesting here is that kafka-jumbo100[1-6] and kafka-jumbo100[7-9] have different device names when it comes to their devicemapper devices:

brouberol@cumin1001:~$ sudo cumin 'kafka-jumbo100[1-9].eqiad.wmnet' "df -h | grep vg | awk '{ print \$1 }'"
9 hosts will be targeted:
kafka-jumbo[1001-1009].eqiad.wmnet
OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9
===== NODE GROUP =====                                                                                                                                       
(3) kafka-jumbo[1007-1009].eqiad.wmnet                                                                                                                       
----- OUTPUT of 'df -h | grep vg ...k '{ print $1 }'' -----                                                                                                  
/dev/mapper/vg0-root                                                                                                                                         
/dev/mapper/vg1-srv                                                                                                                                          
===== NODE GROUP =====                                                                                                                                       
(6) kafka-jumbo[1001-1006].eqiad.wmnet                                                                                                                       
----- OUTPUT of 'df -h | grep vg ...k '{ print $1 }'' -----                                                                                                  
/dev/mapper/vg--flex-root                                                                                                                                    
/dev/mapper/vg--data-srv

Looking at partman/custom/reuse-kafka-jumbo.cfg, I'm not sure it would work for kafka-jumbo100[7-9]:

d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \
	/dev/sdb|1 lvmpv ignore none, \
	/dev/mapper/vg--flex-root|1 ext4 format /, \
	/dev/mapper/vg--data-srv|1 ext4 keep /srv

d-i partman-basicfilesystems/no_swap boolean false

Given that kafka-jumbo100[1-6] are now empty of all data and are due for decommissioning, I'm thinking that we should have the following:

case $(debconf-get netcfg/get_hostname) in \
     ...
    kafka-jumbo100[7-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \
    kafka-jumbo101[0-5]) echo partman/custom/kafka-jumbo.cfg ;; \
    ...

and

# partman/custom/reuse-kafka-jumbo.cfg
d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \
	/dev/sdb|1 lvmpv ignore none, \
	/dev/mapper/vg0-root|1 ext4 format /, \
	/dev/mapper/vg1-srv|1 ext4 keep /srv

d-i partman-basicfilesystems/no_swap boolean false

@Stevemunene @BTullis Does that sound right to you?

brouberol changed the task status from Open to In Progress.Oct 16 2023, 1:59 PM
brouberol moved this task from Misc to In Progress on the Data-Platform-SRE board.

Change 967930 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout

https://gerrit.wikimedia.org/r/967930

Change 967930 merged by Brouberol:

[operations/puppet@production] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout

https://gerrit.wikimedia.org/r/967930

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye

Change 968637 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change the reuse-parts recipe for kafka-jumbo slightly.

https://gerrit.wikimedia.org/r/968637

Change 968637 merged by Btullis:

[operations/puppet@production] Change the reuse-parts recipe for kafka-jumbo slightly.

https://gerrit.wikimedia.org/r/968637

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310251355_btullis_556225_kafka-jumbo1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2023-10-26T08:06:15Z] <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1008

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1008 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310260824_brouberol_818182_kafka-jumbo1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2023-10-26T08:48:58Z] <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1009

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310260906_brouberol_826668_kafka-jumbo1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
brouberol@cumin1001:~$ sudo cumin A:kafka-jumbo 'grep -i version /etc/os-release'
9 hosts will be targeted:
kafka-jumbo[1007-1015].eqiad.wmnet
OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9
===== NODE GROUP =====                                                                                                                                                                       
(9) kafka-jumbo[1007-1015].eqiad.wmnet                                                                                                                                                       
----- OUTPUT of 'grep -i version /etc/os-release' -----                                                                                                                                      
VERSION_ID="11"                                                                                                                                                                              
VERSION="11 (bullseye)"                                                                                                                                                                      
VERSION_CODENAME=bullseye