Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | brouberol | T336041 Bring kafka-jumbo10[09-15] into service | |||
Resolved | brouberol | T346425 Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers | |||
Resolved | brouberol | T336044 Decommission kafka-jumbo100[1-6] | |||
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | BTullis | T288804 Upgrade the Data Engineering infrastructure to Debian Bullseye | |||
Resolved | brouberol | T348495 Upgrade kafka-jumbo100[7-9] to Debian Bullseye |
Event Timeline
I was looking at how to ensure that kafka-jumbo100[7-9] would retain their data during a reimage, and found the following netboot config, thanks to @BTullis:
case $(debconf-get netcfg/get_hostname) in \ ... kafka-jumbo100[1-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \ kafka-jumbo101[0-5]) echo partman/custom/kafka-jumbo.cfg ;; \ ...
What's interesting here is that kafka-jumbo100[1-6] and kafka-jumbo100[7-9] have different device names when it comes to their devicemapper devices:
brouberol@cumin1001:~$ sudo cumin 'kafka-jumbo100[1-9].eqiad.wmnet' "df -h | grep vg | awk '{ print \$1 }'" 9 hosts will be targeted: kafka-jumbo[1001-1009].eqiad.wmnet OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9 ===== NODE GROUP ===== (3) kafka-jumbo[1007-1009].eqiad.wmnet ----- OUTPUT of 'df -h | grep vg ...k '{ print $1 }'' ----- /dev/mapper/vg0-root /dev/mapper/vg1-srv ===== NODE GROUP ===== (6) kafka-jumbo[1001-1006].eqiad.wmnet ----- OUTPUT of 'df -h | grep vg ...k '{ print $1 }'' ----- /dev/mapper/vg--flex-root /dev/mapper/vg--data-srv
Looking at partman/custom/reuse-kafka-jumbo.cfg, I'm not sure it would work for kafka-jumbo100[7-9]:
d-i partman/reuse_partitions_recipe string \ /dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \ /dev/sdb|1 lvmpv ignore none, \ /dev/mapper/vg--flex-root|1 ext4 format /, \ /dev/mapper/vg--data-srv|1 ext4 keep /srv d-i partman-basicfilesystems/no_swap boolean false
Given that kafka-jumbo100[1-6] are now empty of all data and are due for decommissioning, I'm thinking that we should have the following:
case $(debconf-get netcfg/get_hostname) in \ ... kafka-jumbo100[7-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \ kafka-jumbo101[0-5]) echo partman/custom/kafka-jumbo.cfg ;; \ ...
and
# partman/custom/reuse-kafka-jumbo.cfg d-i partman/reuse_partitions_recipe string \ /dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \ /dev/sdb|1 lvmpv ignore none, \ /dev/mapper/vg0-root|1 ext4 format /, \ /dev/mapper/vg1-srv|1 ext4 keep /srv d-i partman-basicfilesystems/no_swap boolean false
@Stevemunene @BTullis Does that sound right to you?
Change 967930 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout
Change 967930 merged by Brouberol:
[operations/puppet@production] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye
Change 968637 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Change the reuse-parts recipe for kafka-jumbo slightly.
Change 968637 merged by Btullis:
[operations/puppet@production] Change the reuse-parts recipe for kafka-jumbo slightly.
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye completed:
- kafka-jumbo1007 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310251355_btullis_556225_kafka-jumbo1007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Mentioned in SAL (#wikimedia-analytics) [2023-10-26T08:06:15Z] <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1008
Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye completed:
- kafka-jumbo1008 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310260824_brouberol_818182_kafka-jumbo1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Mentioned in SAL (#wikimedia-analytics) [2023-10-26T08:48:58Z] <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1009
Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye completed:
- kafka-jumbo1009 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310260906_brouberol_826668_kafka-jumbo1009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
brouberol@cumin1001:~$ sudo cumin A:kafka-jumbo 'grep -i version /etc/os-release' 9 hosts will be targeted: kafka-jumbo[1007-1015].eqiad.wmnet OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9 ===== NODE GROUP ===== (9) kafka-jumbo[1007-1015].eqiad.wmnet ----- OUTPUT of 'grep -i version /etc/os-release' ----- VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye