Page MenuHomePhabricator

Investigate group.initial.rebalance.delay.ms Kafka setting
Closed, ResolvedPublic

Description

As indicated in the parent task, whenever ChangeProp is restarted or some workers die and get respawned, there's a significant number of rebalances happen while the workers start which apparently can mess up broker state and end up in a situation when no consumer within the consumer group gets an assigned partition.

In order to prevent that a new group.initial.rebalance.delay.ms property defaulting to 3 seconds was added to kafka configuration starting with version 0.11 (KIP)

I thinnk that increasing this value to soemthing like 10 seconds could help with initial rebalancing and some quite some load.

Unfortunately the main kafka cluster is still on 0.9, so this one is blocked until we upgrade it.

Event Timeline

Pchelolo triaged this task as Normal priority.Mar 13 2018, 8:10 PM
Pchelolo created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2018, 8:10 PM
Ottomata moved this task from Incoming to Radar on the Analytics board.Mar 15 2018, 4:35 PM
elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Mar 16 2018, 2:46 PM

@Ottomata @elukey now that we were successful in upgrading Kafka, I think we can try increasing this to 10 seconds. Do you think the number is reasonable?

Yeah, I think that sounds fine.

Yeah, I think that sounds fine.

+1

Change 432615 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Kafka: increase group.initial.rebalance.delay.ms to 10s.

https://gerrit.wikimedia.org/r/432615

Pchelolo moved this task from blocked to doing on the Services board.May 11 2018, 7:22 PM
Pchelolo edited projects, added Services (doing); removed Services (blocked).

Change 432615 merged by Elukey:
[operations/puppet@production] Kafka: increase group.initial.rebalance.delay.ms to 10s.

https://gerrit.wikimedia.org/r/432615

Pchelolo closed this task as Resolved.May 15 2018, 11:36 PM
Pchelolo edited projects, added Services (done); removed Services (doing), Patch-For-Review.

This was deployed to production, the number of rebalance log messages during the consumer startups declined, so I'm resolving the ticket.