Page MenuHomePhabricator

Investigate group.initial.rebalance.delay.ms Kafka setting
Closed, ResolvedPublic

Description

As indicated in the parent task, whenever ChangeProp is restarted or some workers die and get respawned, there's a significant number of rebalances happen while the workers start which apparently can mess up broker state and end up in a situation when no consumer within the consumer group gets an assigned partition.

In order to prevent that a new group.initial.rebalance.delay.ms property defaulting to 3 seconds was added to kafka configuration starting with version 0.11 (KIP)

I thinnk that increasing this value to soemthing like 10 seconds could help with initial rebalancing and some quite some load.

Unfortunately the main kafka cluster is still on 0.9, so this one is blocked until we upgrade it.

Event Timeline

Pchelolo created this task.

@Ottomata @elukey now that we were successful in upgrading Kafka, I think we can try increasing this to 10 seconds. Do you think the number is reasonable?

Yeah, I think that sounds fine.

Yeah, I think that sounds fine.

+1

Change 432615 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Kafka: increase group.initial.rebalance.delay.ms to 10s.

https://gerrit.wikimedia.org/r/432615

Change 432615 merged by Elukey:
[operations/puppet@production] Kafka: increase group.initial.rebalance.delay.ms to 10s.

https://gerrit.wikimedia.org/r/432615

This was deployed to production, the number of rebalance log messages during the consumer startups declined, so I'm resolving the ticket.