Investigate group.initial.rebalance.delay.ms Kafka setting
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Pchelolo
	Mar 13 2018, 8:10 PM

Description

As indicated in the parent task, whenever ChangeProp is restarted or some workers die and get respawned, there's a significant number of rebalances happen while the workers start which apparently can mess up broker state and end up in a situation when no consumer within the consumer group gets an assigned partition.

In order to prevent that a new group.initial.rebalance.delay.ms property defaulting to 3 seconds was added to kafka configuration starting with version 0.11 (KIP)

I thinnk that increasing this value to soemthing like 10 seconds could help with initial rebalancing and some quite some load.

Unfortunately the main kafka cluster is still on 0.9, so this one is blocked until we upgrade it.

Details

	Subject	Repo	Branch	Lines +/-
	Kafka: increase group.initial.rebalance.delay.ms to 10s.	operations/puppet	production	+19 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T179684 Kafka sometimes misses to rebalance topics properly
Resolved	Ottomata	T167039 Upgrade Kafka on main cluster with security features
Resolved	• Pchelolo	T189618 Investigate group.initial.rebalance.delay.ms Kafka setting

Event Timeline

• Pchelolo triaged this task as Medium priority.Mar 13 2018, 8:10 PM

• Pchelolo created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2018, 8:10 PM

• mobrovac removed a subtask: T189621: Enable controlled debug logging for change-prop.Mar 13 2018, 10:23 PM

Ottomata moved this task from Incoming to Radar on the Analytics board.Mar 15 2018, 4:35 PM

elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Mar 16 2018, 2:46 PM

Ottomata added a parent task: T167039: Upgrade Kafka on main cluster with security features.Apr 16 2018, 8:02 PM

@Ottomata @elukey now that we were successful in upgrading Kafka, I think we can try increasing this to 10 seconds. Do you think the number is reasonable?

Yeah, I think that sounds fine.

In T189618#4194933, @Ottomata wrote:

Yeah, I think that sounds fine.

Change 432615 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Kafka: increase group.initial.rebalance.delay.ms to 10s.

https://gerrit.wikimedia.org/r/432615

gerritbot added a project: Patch-For-Review.May 11 2018, 5:58 PM

• Pchelolo moved this task from blocked to doing on the Services board.May 11 2018, 7:22 PM

• Pchelolo edited projects, added Services (doing); removed Services (blocked).

Change 432615 merged by Elukey:
[operations/puppet@production] Kafka: increase group.initial.rebalance.delay.ms to 10s.

https://gerrit.wikimedia.org/r/432615

This was deployed to production, the number of rebalance log messages during the consumer startups declined, so I'm resolving the ticket.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Investigate group.initial.rebalance.delay.ms Kafka settingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate group.initial.rebalance.delay.ms Kafka setting
Closed, ResolvedPublic
Actions

Related Objects
Search...