Upgrade kafka-jumbo100[7-9] to Debian Bullseye
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	brouberol
	Oct 10 2023, 7:26 AM

Details

	Subject	Repo	Branch	Lines +/-
	Change the reuse-parts recipe for kafka-jumbo slightly.	operations/puppet	production	+2 -2
	Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	brouberol	T336041 Bring kafka-jumbo10[09-15] into service
Resolved	brouberol	T346425 Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers
Resolved	brouberol	T336044 Decommission kafka-jumbo100[1-6]
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	BTullis	T288804 Upgrade the Data Engineering infrastructure to Debian Bullseye
Resolved	brouberol	T348495 Upgrade kafka-jumbo100[7-9] to Debian Bullseye

Event Timeline

brouberol created this task.Oct 10 2023, 7:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 10 2023, 7:26 AM

brouberol added a parent task: T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye.Oct 10 2023, 7:26 AM

brouberol added a parent task: T336044: Decommission kafka-jumbo100[1-6].

brouberol moved this task from Incoming to Misc on the Data-Platform-SRE board.Oct 10 2023, 9:04 AM

I was looking at how to ensure that kafka-jumbo100[7-9] would retain their data during a reimage, and found the following netboot config, thanks to @BTullis:

case $(debconf-get netcfg/get_hostname) in \
     ...
    kafka-jumbo100[1-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \
    kafka-jumbo101[0-5]) echo partman/custom/kafka-jumbo.cfg ;; \
    ...

What's interesting here is that kafka-jumbo100[1-6] and kafka-jumbo100[7-9] have different device names when it comes to their devicemapper devices:

brouberol@cumin1001:~$ sudo cumin 'kafka-jumbo100[1-9].eqiad.wmnet' "df -h | grep vg | awk '{ print \$1 }'"
9 hosts will be targeted:
kafka-jumbo[1001-1009].eqiad.wmnet
OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9
===== NODE GROUP =====                                                                                                                                       
(3) kafka-jumbo[1007-1009].eqiad.wmnet                                                                                                                       
----- OUTPUT of 'df -h | grep vg ...k '{ print $1 }'' -----                                                                                                  
/dev/mapper/vg0-root                                                                                                                                         
/dev/mapper/vg1-srv                                                                                                                                          
===== NODE GROUP =====                                                                                                                                       
(6) kafka-jumbo[1001-1006].eqiad.wmnet                                                                                                                       
----- OUTPUT of 'df -h | grep vg ...k '{ print $1 }'' -----                                                                                                  
/dev/mapper/vg--flex-root                                                                                                                                    
/dev/mapper/vg--data-srv

Looking at partman/custom/reuse-kafka-jumbo.cfg, I'm not sure it would work for kafka-jumbo100[7-9]:

d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \
	/dev/sdb|1 lvmpv ignore none, \
	/dev/mapper/vg--flex-root|1 ext4 format /, \
	/dev/mapper/vg--data-srv|1 ext4 keep /srv

d-i partman-basicfilesystems/no_swap boolean false

Given that kafka-jumbo100[1-6] are now empty of all data and are due for decommissioning, I'm thinking that we should have the following:

case $(debconf-get netcfg/get_hostname) in \
     ...
    kafka-jumbo100[7-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \
    kafka-jumbo101[0-5]) echo partman/custom/kafka-jumbo.cfg ;; \
    ...

and

# partman/custom/reuse-kafka-jumbo.cfg
d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \
	/dev/sdb|1 lvmpv ignore none, \
	/dev/mapper/vg0-root|1 ext4 format /, \
	/dev/mapper/vg1-srv|1 ext4 keep /srv

d-i partman-basicfilesystems/no_swap boolean false

@Stevemunene @BTullis Does that sound right to you?

brouberol changed the task status from Open to In Progress.Oct 16 2023, 1:59 PM

brouberol moved this task from Misc to In Progress on the Data-Platform-SRE board.

brouberol moved this task from In Progress to Ready for Work on the Data-Platform-SRE board.Oct 20 2023, 3:43 PM

brouberol moved this task from Ready for Work to Needs Review on the Data-Platform-SRE board.Oct 23 2023, 12:55 PM

Change 967930 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout

https://gerrit.wikimedia.org/r/967930

gerritbot added a project: Patch-For-Review.Oct 23 2023, 1:11 PM

brouberol moved this task from Needs Review to In Progress on the Data-Platform-SRE board.Oct 24 2023, 3:30 PM

Change 967930 merged by Brouberol:

[operations/puppet@production] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout

https://gerrit.wikimedia.org/r/967930

Maintenance_bot removed a project: Patch-For-Review.Oct 25 2023, 9:30 AM

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye

Change 968637 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change the reuse-parts recipe for kafka-jumbo slightly.

https://gerrit.wikimedia.org/r/968637

gerritbot added a project: Patch-For-Review.Oct 25 2023, 11:27 AM

Change 968637 merged by Btullis:

[operations/puppet@production] Change the reuse-parts recipe for kafka-jumbo slightly.

https://gerrit.wikimedia.org/r/968637

Maintenance_bot removed a project: Patch-For-Review.Oct 25 2023, 12:10 PM

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye completed:

kafka-jumbo1007 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310251355_btullis_556225_kafka-jumbo1007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2023-10-26T08:06:15Z] <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1008

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye completed:

kafka-jumbo1008 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310260824_brouberol_818182_kafka-jumbo1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2023-10-26T08:48:58Z] <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1009

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye completed:

kafka-jumbo1009 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310260906_brouberol_826668_kafka-jumbo1009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

brouberol@cumin1001:~$ sudo cumin A:kafka-jumbo 'grep -i version /etc/os-release'
9 hosts will be targeted:
kafka-jumbo[1007-1015].eqiad.wmnet
OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9
===== NODE GROUP =====                                                                                                                                                                       
(9) kafka-jumbo[1007-1015].eqiad.wmnet                                                                                                                                                       
----- OUTPUT of 'grep -i version /etc/os-release' -----                                                                                                                                      
VERSION_ID="11"                                                                                                                                                                              
VERSION="11 (bullseye)"                                                                                                                                                                      
VERSION_CODENAME=bullseye

brouberol closed this task as Resolved.Oct 26 2023, 9:26 AM

brouberol mentioned this in T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye.Oct 26 2023, 9:39 AM

brouberol moved this task from In Progress to Done on the Data-Platform-SRE board.Nov 6 2023, 1:36 PM

Upgrade kafka-jumbo100[7-9] to Debian BullseyeClosed, ResolvedPublicActions

Details

Related ObjectsSearch...

Event Timeline

Upgrade kafka-jumbo100[7-9] to Debian Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...