Page MenuHomePhabricator

Create a cookbook to execute Kafka rolling upgrades
Closed, ResolvedPublic

Description

Starting point: https://docs.google.com/document/d/1eHqkgKZitERH3M4NkJPA3qW3qWqF4haIkyMbZLJtATw/edit?tab=t.0

We should create a cookbook to upgrade single kafka brokers and/or an entire cluster to Kafka 3.5 safely. Ideally this work will be reused in the future to ease migrations to future versions, without waiting years to do it like we have been done for Kafka 1.1 :D

Event Timeline

elukey triaged this task as Medium priority.Feb 12 2026, 9:49 AM

From the tests in T416670, the procedure should be:

  • Disable puppet on all target brokers.
  • Merge a puppet change that sets the new target distribution, like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239142
  • For every broker:
    • systemctl stop kafka (needed to avoid race conditions between the old and new kafka processes trying to hold a zk session).
    • force a puppet run

Note: rollout and rollback are the same, the only thing that changes is the puppet patch. It is questionable if we want to have a full rollback solution or just allow the user to file a new puppet patch and restart the cookbook.

At this point the cluster is running the new kafka code, but inter.broker.protocol.version is still using the old value. Due to a change happened in the way the schema related to the clients' offset schema, rolling back is not possible after changing this value. So the user should likely wait a couple of days before proceeding further.

Then, it is sufficient to merge the inter.broker.protocol.version upgrade and roll restart brokers one at the time.

So ideally two cookbooks, very simple ones.

Change #1247942 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] WIP: add sre.kafka.change-confluent-distro-version

https://gerrit.wikimedia.org/r/1247942

Change #1248430 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::test::broker: upgrade Kafka to 3.5

https://gerrit.wikimedia.org/r/1248430

Change #1248430 merged by Elukey:

[operations/puppet@production] role::kafka::test::broker: upgrade Kafka to 3.5

https://gerrit.wikimedia.org/r/1248430

Change #1247942 merged by Elukey:

[operations/cookbooks@master] Add the sre.kafka.change-confluent-distro-version cookbook

https://gerrit.wikimedia.org/r/1247942

Change #1249940 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::test::broker: move to Confluent Kafka 3.7

https://gerrit.wikimedia.org/r/1249940

Change #1249940 merged by Elukey:

[operations/puppet@production] role::kafka::test::broker: move to Confluent Kafka 3.7

https://gerrit.wikimedia.org/r/1249940

Change #1262008 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::test: update the inter broker protocol

https://gerrit.wikimedia.org/r/1262008

Change #1262008 merged by Elukey:

[operations/puppet@production] role::kafka::test: update the inter broker protocol

https://gerrit.wikimedia.org/r/1262008

Change #1262031 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-test1006 to Trixie

https://gerrit.wikimedia.org/r/1262031

Change #1262031 merged by Elukey:

[operations/puppet@production] Move kafka-test1006 to Trixie

https://gerrit.wikimedia.org/r/1262031

elukey claimed this task.