Page MenuHomePhabricator

Upgrade kafka-main to Kafka 3.7
Closed, ResolvedPublic

Description

Umbrella task for the upgrade of the kafka-main clusters to Kafka 3.7.
Additionally: preforming a vlan migration for select broker hosts in eqiad, and a Debian Trixie upgrade for all hosts in both clusters in T427088: [Post kafka-main 3.7 upgrade work] Reimage brokers to trixie/JDK21 & vlan migrations on select brokers

Items (summarized):

kafka-main has 2 clusters, main-codfw and main-eqiad. We'll upgrade codfw first then move onto eqiad. We'll then upgrade the inter broker protocol for all brokers sequentially. Following, the upgrade, we'll also upgrade all hosts to Debian Trixie in T427088: [Post kafka-main 3.7 upgrade work] Reimage brokers to trixie/JDK21 & vlan migrations on select brokers

  • kafka-main codfw:
    • Pin the inter broker protocol version on the brokers to hieradata/role/common/kafka/main.yaml:profile::kafka::broker::inter_broker_protocol_version: 1.1.0
    • Perform a rolling upgrade of the brokers, that will restart with the pinned version configurations and the new kafka version, using host-by-host patches and service restart of kafka broker, e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1273863
      • kafka-main2006
      • kafka-main2007
      • kafka-main2008
      • kafka-main2009
      • kafka-main2010
    • Change the inter broker protocol version to match the new kafka version
      • Set hieradata/role/common/kafka/main.yaml:profile::kafka::broker::inter_broker_protocol_version: 3.7
    • Perform a final rolling restart of the brokers
  • kafka-main eqiad:
    • Pin the inter broker protocol version on the brokers to hieradata/role/common/kafka/main.yaml:profile::kafka::broker::inter_broker_protocol_version: 1.1.0
    • Perform a rolling upgrade of the brokers, that will restart with the pinned version configurations and the new kafka version, using host-by-host patches and service restart of kafka broker, e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1273863
      • kafka-main1006
      • kafka-main1007
      • kafka-main1008
      • kafka-main1009
      • kafka-main1010
    • Change the inter broker protocol version to match the new kafka version
      • Set hieradata/role/common/kafka/main.yaml:profile::kafka::broker::inter_broker_protocol_version: 3.7
    • Perform a final rolling restart of the brokers

Additional migrations/upgrades & post-upgrade cleanup: see T427088: [Post kafka-main 3.7 upgrade work] Reimage brokers to trixie/JDK21 & vlan migrations on select brokers

Cluster states:

kafka-main codfw:

Kafka BrokerConfluent distribution 77Inter-broker protocol
kafka-main2006.codfw.wmnetUpgraded ✅3.7 ✅
kafka-main2007.codfw.wmnetUpgraded ✅3.7 ✅
kafka-main2008.codfw.wmnetUpgraded ✅3.7 ✅
kafka-main2009.codfw.wmnetUpgraded ✅3.7 ✅
kafka-main2010.codfw.wmnetUpgraded ✅3.7 ✅

kafka-main eqiad:

Kafka BrokerConfluent distribution 77Inter-broker protocol
kafka-main1006.eqiad.wmnetUpgraded ✅3.7 ✅
kafka-main1007.eqiad.wmnetUpgraded ✅3.7 ✅
kafka-main1008.eqiad.wmnetUpgraded ✅3.7 ✅
kafka-main1009.eqiad.wmnetUpgraded ✅3.7 ✅
kafka-main1010.eqiad.wmnetUpgraded ✅3.7 ✅

[0] - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Upgrade_to_Kafka_3.7

Details

Related Changes in Gerrit:
Show related patches Customize query in gerrit

Event Timeline

JMeybohm renamed this task from Upgrade kafka-main to Kafka 3.5 to Upgrade kafka-main to Kafka 3.x.Mar 12 2026, 5:15 PM
JMeybohm updated the task description. (Show Details)

Change #1278832 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] role::kafka::main: move to Confluent Kafka 3.7

https://gerrit.wikimedia.org/r/1278832

Change #1278832 merged by Jasmine:

[operations/puppet@production] kafka-main: set main-codfw cluster brokers to Confluent distro 77 (3.7)

https://gerrit.wikimedia.org/r/1278832

Change #1282999 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7)

https://gerrit.wikimedia.org/r/1282999

jasmine_ renamed this task from Upgrade kafka-main to Kafka 3.x to Upgrade kafka-main to Kafka 3.7.May 5 2026, 5:54 PM
jasmine_ updated the task description. (Show Details)

Change #1282999 merged by Jasmine:

[operations/puppet@production] kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7)

https://gerrit.wikimedia.org/r/1282999

Change #1283988 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main: set codfw brokers inter-broker protocol to 3.7

https://gerrit.wikimedia.org/r/1283988

As of this morning, both kafka-main clusters (main-codfw) and (main-eqiad) have been upgraded to Kafka 3.7.
We plan proceed with the inter.broker-protocol version upgrade for both clusters Thursday at 13:30 utc.

Some notes on leader election during the codfw upgrade:
(Edit: Now summarized in T425528: Rework ACLs on Kafka 3.x clusters)

Although the cookbook succeeded in upgrading all brokers in main-codfw, it failed on the final command to balance replicas:

$ kafka leader-election --election-type PREFERRED --all-topic-partitions
Not authorized to perform leader election
Not authorized to perform leader election
org.apache.kafka.server.common.AdminCommandFailedException: Not authorized to perform leader election
        at org.apache.kafka.tools.LeaderElectionCommand.electLeaders(LeaderElectionCommand.java:134)
        at org.apache.kafka.tools.LeaderElectionCommand.run(LeaderElectionCommand.java:117)
        at org.apache.kafka.tools.LeaderElectionCommand.mainNoExit(LeaderElectionCommand.java:71)
        at org.apache.kafka.tools.LeaderElectionCommand.main(LeaderElectionCommand.java:66)

We were able to run leader-election and balance replicas manually after adding additional permissions for the anonymous user. Since anonymous users can’t add ACLs, we needed to provide credentials in form of mTLS certificates to the kafka commands which would grant us super user privileges (by pretending we’re the local machine).

# Copy certificate and secret data from kafka server.properties to a temporary client.properties
$ egrep '^(ssl|security)' /etc/kafka/server.properties > client.properties
# Then use tcp/9093 (which uses TLS) instead of tcp/9092 to connect to kafka 
# and add the client.properties for authentication via --command-config
$ export KAFKA_ARGS="--bootstrap-server kafka-main2006.codfw.wmnet:9093,kafka-main2007.codfw.wmnet:9093,kafka-main2008.codfw.wmnet:9093,kafka-main2009.codfw.wmnet:9093,kafka-main2010.codfw.wmnet:9093 --command-config client.properties"
# Add Alter cluster permission to the anonymoyus user
$ kafka-acls $KAFKA_ARGS --add --allow-principal User:ANONYMOUS --operation Alter --cluster
# In order for users to be able to 'kafka topics --describe', we also had to add:
$ kafka-acls $KAFKA_ARGS --add --allow-principal User:ANONYMOUS --operation DescribeConfigs --topic '*'

Thanks to @JMeybohm & @elukey these ACls were also added to main-eqiad prior to the upgrade preemptively resolving the issue.

The remaining work follows in T425528: Rework ACLs on Kafka 3.x clusters.

Change #1283988 merged by Jasmine:

[operations/puppet@production] kafka-main: set codfw brokers inter-broker protocol to 3.7

https://gerrit.wikimedia.org/r/1283988

Change #1284646 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main: set eqiad (all) brokers inter-broker protocol to 3.7

https://gerrit.wikimedia.org/r/1284646

Change #1284646 merged by Jasmine:

[operations/puppet@production] kafka-main: set eqiad (all) brokers inter-broker protocol to 3.7

https://gerrit.wikimedia.org/r/1284646

jasmine_ updated the task description. (Show Details)

Change #1285474 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main1006: apply host-level override, jdk 21 in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285474

Change #1285475 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main1007: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285475

Change #1285476 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main1008: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285476

Change #1285477 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main1009: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285477

Change #1285478 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main1010: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285478

Change #1288917 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main2006: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288917

Change #1288918 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main2007: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288918

Change #1288919 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main2008: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288919

Change #1288920 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main2009: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288920

Change #1288921 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] kafka-main2010: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288921

Resolving as all brokers are now on kafka 3.7 🎉 Thanks @JMeybohm & @elukey

For additional vlan migration on select brokers, Trixie upgrade work + post-upgrade cleanup: see T427088: [Post kafka-main 3.7 upgrade work] Reimage brokers to trixie/JDK21 & vlan migrations on select brokers

Change #1288917 merged by Jasmine:

[operations/puppet@production] kafka-main2006: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288917

Change #1288918 merged by JMeybohm:

[operations/puppet@production] kafka-main2007: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288918

Change #1288919 merged by Jasmine:

[operations/puppet@production] kafka-main2008: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1288919

Change #1285474 merged by Jasmine:

[operations/puppet@production] kafka-main1006: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285474

Change #1285475 merged by Jasmine:

[operations/puppet@production] kafka-main1007: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285475

Change #1285478 merged by Jasmine:

[operations/puppet@production] kafka-main1010: apply host-level override in advance of trixie upgrade [0]

https://gerrit.wikimedia.org/r/1285478