Page MenuHomePhabricator

Upgrade Observability Kafka-logging hosts to trixie
Closed, ResolvedPublic

Description

In addition to the upgrades of Kafka itself in T416669: Upgrade Kafka to version 3.x, we need to upgrade the base OS of the hosts to Debian trixie.

Note: after T423723: Upgrade kafka-logging to version 3.7 done we'll be ready to start trixie upgrades

Logging:

  • eqiad
    • kafka-logging1001
    • kafka-logging1002
    • kafka-logging1003
    • kafka-logging1004
    • kafka-logging1005
  • codfw
    • kafka-logging2001
    • kafka-logging2002
    • kafka-logging2003
    • kafka-logging2004
    • kafka-logging2005

Monitoring hosts logged in T418858: Migrate kafkamon hosts to trixie

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ordering between the debian upgrade and the Kafka version upgrade needs to be considered - if there are no packages for the old kafka in trixie then we need to upgrade to 3.5 first

Looks like this would be the first kafka cluster on trixie. For Kafka 3.5 we would likely bring 7.5.12-1 (3.5) to trixie, along with an appropriate jdk version. I'm not sure off hand which jdk version is recommended for 3.5

apt1002:~$ sudo -E reprepro ls confluent-kafka
confluent-kafka |  7.4.0-1 |   buster-wikimedia | amd64
confluent-kafka |  7.4.0-1 | bullseye-wikimedia | amd64
confluent-kafka | 7.5.12-1 | bullseye-wikimedia | amd64
confluent-kafka | 7.5.12-1 | bookworm-wikimedia | amd64

Or, for upgrading Debian with current Kafka 2.11, we could possibly copy to trixie along with a jdk.

apt1002:~$ sudo -E reprepro ls confluent-kafka-2.11
confluent-kafka-2.11 | 1.1.0-1 |   buster-wikimedia | amd64, i386
confluent-kafka-2.11 | 1.1.0-1 | bullseye-wikimedia | amd64, i386
confluent-kafka-2.11 | 1.1.0-1 | bookworm-wikimedia | amd64, i386

Kafka 2.11 is running under openjdk-8 (which afaik this openjdk is the case for all the live kafka clusters today). I'm not sure if copying openjdk-8 to trixie would be any issue, what do you think @MoritzMuehlenhoff?

apt1002:~$ sudo -E reprepro ls openjdk-8-jdk
openjdk-8-jdk | 8u342-b07-1~deb10u1 |   buster-wikimedia | amd64
openjdk-8-jdk |  8u412-ga-1~deb10u1 |   buster-wikimedia | amd64
openjdk-8-jdk |  8u472-ga-1~deb11u1 | bullseye-wikimedia | amd64
openjdk-8-jdk |  8u472-ga-1~deb12u1 | bookworm-wikimedia | amd64
hnowlan triaged this task as High priority.Mar 25 2026, 4:54 PM
hnowlan moved this task from Inbox to Prioritized on the Observability-Logging board.
herron renamed this task from Upgrade Observability Kafka hosts to trixie to Upgrade Observability Kafka-logging hosts to trixie.Apr 17 2026, 4:13 PM

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging2005.codfw.wmnet with OS trixie

Change #1280431 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging2005: update IP addresses

https://gerrit.wikimedia.org/r/1280431

Change #1280431 merged by Herron:

[operations/puppet@production] kafka-logging2005: update IP addresses

https://gerrit.wikimedia.org/r/1280431

Change #1280467 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging2005: use jdk 21 in trixie

https://gerrit.wikimedia.org/r/1280467

Change #1280467 merged by Herron:

[operations/puppet@production] kafka-logging2005: use jdk 21 in trixie

https://gerrit.wikimedia.org/r/1280467

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging2005.codfw.wmnet with OS trixie completed:

  • kafka-logging2005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604301433_herron_3049150_kafka-logging2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging2004.codfw.wmnet with OS trixie

Change #1281529 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging2004: use trixie jvm settings

https://gerrit.wikimedia.org/r/1281529

Change #1281529 merged by Herron:

[operations/puppet@production] kafka-logging2004: use trixie jvm settings

https://gerrit.wikimedia.org/r/1281529

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging2004.codfw.wmnet with OS trixie completed:

  • kafka-logging2004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605011534_herron_3281907_kafka-logging2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging2003.codfw.wmnet with OS trixie

Change #1281562 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging2003: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1281562

Change #1281562 merged by Herron:

[operations/puppet@production] kafka-logging2003: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1281562

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging2003.codfw.wmnet with OS trixie completed:

  • kafka-logging2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605011804_herron_3302269_kafka-logging2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging2002.codfw.wmnet with OS trixie

Change #1281596 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging2002: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1281596

Change #1281596 merged by Herron:

[operations/puppet@production] kafka-logging2002: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1281596

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging2002.codfw.wmnet with OS trixie completed:

  • kafka-logging2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605011954_herron_3310785_kafka-logging2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging2001.codfw.wmnet with OS trixie

Change #1282369 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging2001: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1282369

Change #1282369 merged by Herron:

[operations/puppet@production] kafka-logging2001: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1282369

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging2001.codfw.wmnet with OS trixie completed:

  • kafka-logging2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605041439_herron_3973968_kafka-logging2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging1005.eqiad.wmnet with OS trixie

Change #1282412 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging1005: prep for trixie

https://gerrit.wikimedia.org/r/1282412

Change #1282412 merged by Herron:

[operations/puppet@production] kafka-logging1005: prep for trixie

https://gerrit.wikimedia.org/r/1282412

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging1005.eqiad.wmnet with OS trixie completed:

  • kafka-logging1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605041948_herron_4083499_kafka-logging1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging1004.eqiad.wmnet with OS trixie

Change #1282997 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging1004: prep for trixie

https://gerrit.wikimedia.org/r/1282997

Change #1282997 merged by Herron:

[operations/puppet@production] kafka-logging1004: prep for trixie

https://gerrit.wikimedia.org/r/1282997

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging1004.eqiad.wmnet with OS trixie completed:

  • kafka-logging1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605051428_herron_254964_kafka-logging1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging1003.eqiad.wmnet with OS trixie

Change #1283055 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging1003: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1283055

Change #1283055 merged by Herron:

[operations/puppet@production] kafka-logging1003: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1283055

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging1003.eqiad.wmnet with OS trixie completed:

  • kafka-logging1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605051744_herron_369380_kafka-logging1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging1002.eqiad.wmnet with OS trixie

Change #1283081 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging1002: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1283081

Change #1283081 merged by Herron:

[operations/puppet@production] kafka-logging1002: update IP and prep for trixie

https://gerrit.wikimedia.org/r/1283081

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging1002.eqiad.wmnet with OS trixie completed:

  • kafka-logging1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605052004_herron_387447_kafka-logging1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host kafka-logging1001.eqiad.wmnet with OS trixie

Change #1283139 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging1001: prep for trixie

https://gerrit.wikimedia.org/r/1283139

Change #1283139 merged by Herron:

[operations/puppet@production] kafka-logging1001: prep for trixie

https://gerrit.wikimedia.org/r/1283139

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host kafka-logging1001.eqiad.wmnet with OS trixie completed:

  • kafka-logging1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605060045_herron_423178_kafka-logging1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
herron claimed this task.
herron updated the task description. (Show Details)

All kafka-logging hosts have been upgraded to trixie