Page MenuHomePhabricator

Single Kafka partition replica periodically lags
Closed, ResolvedPublic21 Estimated Story Points

Description

A few times each day, we see warning alerts for kafka broker replica maxlag on a single partition. The partition replica quickly catches back up, so this alert does not cause any real problems. However, it is annoying, and it shouldn't be happening.

Figure out why this is happening (Maybe garbage collection?) and stop it.

Event Timeline

Ottomata claimed this task.
Ottomata raised the priority of this task from to Medium.
Ottomata updated the task description. (Show Details)
Ottomata added a subscriber: Ottomata.

Hm, looks like this is our problem:

https://issues.apache.org/jira/browse/KAFKA-2143

I'm not entirely sure what to do about it other than wait. Perhaps we can adjust monitoring to be more lenient about these alerts.

Change 259079 had a related patch set uploaded (by Ottomata):
Modifying kafka replica alerting to be more lenient

https://gerrit.wikimedia.org/r/259079

Change 259079 merged by Ottomata:
Modifying kafka replica alerting to be more lenient

https://gerrit.wikimedia.org/r/259079

Change 259083 had a related patch set uploaded (by Ottomata):
Update kafka submodule with alerting change

https://gerrit.wikimedia.org/r/259083

Change 259083 merged by Ottomata:
Update kafka submodule with alerting change

https://gerrit.wikimedia.org/r/259083

I've made some puppet changes that will hopefully keep alerts from triggering for this. Let's wait a few days to see if we get any new alerts.

I've also been pointed to https://issues.apache.org/jira/browse/KAFKA-2477 as a potential cause of this problem. Unfortunately, there are so many log messages coming in from closed sockets from mediawiki that the logs that might verify this have already been rotated away. I'm making a patch to keep more log files.

Milimetric set Security to None.

From https://issues.apache.org/jira/browse/KAFKA-2477 it seems that the original issue is a series of deletions for a partition terminated by a warning + error due to a race condition. I checked in grafana and there has been a lag at the same time.

elukey@kafka1014:/var/log$ hostname
kafka1014

grep -v "Closing" /var/log/kafka/kafka.log

[2016-01-14 18:35:23,333] 5715135727 [ReplicaFetcherThread-3-22] INFO kafka.log.Log - Rolled new log segment for 'webrequest_mobile-7' in 15 ms.
[2016-01-14 18:36:13,114] 5715185508 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 562970413 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,129] 5715185523 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 565052425 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,129] 5715185523 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 567127917 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,129] 5715185523 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 569180392 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,141] 5715185535 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 571200398 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,141] 5715185535 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 573252046 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,141] 5715185535
....
....
....
[2016-01-14 18:36:13,148] 5715185542 [ReplicaFetcherThread-2-18] INFO kafka.log.Log - Scheduling log segment 654815569 for log mediawiki_CirrusSearchRequestSet-4 for deletion.
[2016-01-14 18:36:13,619] 5715186013 [ReplicaFetcherThread-2-18] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-2-18], Replica 14 for partition [mediawiki_CirrusSe
archRequestSet,4] reset its fetch offset from 587590358 to current leader 18's start offset 587590358
[2016-01-14 18:36:13,619] 5715186013 [ReplicaFetcherThread-2-18] ERROR kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-2-18], Current offset 655923614 for partition [medi
awiki_CirrusSearchRequestSet,4] out of range; reset offset to 587590358

I can see also spikes in "under replicated partitions" at the same time of the lags (kinda expected). It seems that this issue is indeed caused by https://issues.apache.org/jira/browse/KAFKA-2477, that has been fixed in Kafka 0.9.0

I am still ignorant about Kafka's internals though, so I'd need more time for a definitive answer.

Marking this ticket as blocked by Kafka 0.9 migration for the moment (since the bug has been fixed in that release) but I'll complete my investigation with more data in the next days (one datapoint is not super conclusive).

Change 283675 had a related patch set uploaded (by Ottomata):
Disable MaxLag icinga check in kafka::server::monitoring

https://gerrit.wikimedia.org/r/283675

Change 283675 merged by Ottomata:
Disable MaxLag icinga check in kafka::server::monitoring

https://gerrit.wikimedia.org/r/283675

Change 283677 had a related patch set uploaded (by Ottomata):
Update kafka submodule disabling MaxLax icinga alert

https://gerrit.wikimedia.org/r/283677

Change 283677 merged by Ottomata:
Update kafka submodule disabling MaxLax icinga alert

https://gerrit.wikimedia.org/r/283677

FYI I disabled the MaxLag icinga alert. We should revisit this alert after we upgrade to 0.9.

elukey added a subscriber: elukey.
Milimetric set the point value for this task to 21.
Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric moved this task from Next Up to Done on the Analytics-Kanban board.

This bug was resolved by Kafka 0.9 migration