Every few days, analytics1021 gets kicked out of its partition leader
role.
The last few occurrences are:
2014-08-06: bug 69244
2014-08-13: bug 69665
2014-08-16: bug 69666
Version: unspecified
Severity: normal
QChris | |
Aug 17 2014, 3:42 PM |
F14517: analytics1021-2014-09-04T15-26-03.tar.gz | |
Nov 22 2014, 3:38 AM |
Every few days, analytics1021 gets kicked out of its partition leader
role.
The last few occurrences are:
2014-08-06: bug 69244
2014-08-13: bug 69665
2014-08-16: bug 69666
Version: unspecified
Severity: normal
Thanks for bugging this Christian.
IMHO, this has been a fairly regular problem with this host and attempts to reimage/fix have not been successful. Andrew/Gage -- can you update?
otto wrote:
The core of this issue is a timeout of the Zookeeper connection, which neither Gage nor I have been able to solve.
Quick summary: Kafka brokers need to maintain a live connection with Zookeeper in order to remain in the ISR. Brokers set a timeout. If a broker can't talk to Zookeeper within this timeout, it will close the connection it has and attempt to open a new one, most likely to a different Zookeeper host than it had before. Zookeeper notices when it closes this connection, and then tells all of the other brokers that this broker has left the ISR. Its leadership for any partitions is then demoted. It takes this broker usually about less than a second to reconnect to another Zookeeper host and rejoin the cluster. Thus far, when this happens, Gage or I have logged in and manually started a preferred replica election. This bring's the offending broker's leadership status back to normal.
This would only be an annoyance, if it weren't for small varnishkafka hiccups this causes. Ideally, when a broker loses partition leadership, producers would be notified of the metadata change quickly enough that their buffer's don't fill up. We have noticed that for some higher volume partitions (upload and/or bits), some varnishkafkas drop messages during the short time that leadership metadata is being propagated.
Action items:
See also:
https://rt.wikimedia.org/Ticket/Display.html?id=6877
otto wrote:
Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds. Leader election can take up to 3 seconds. Hopefully this will solve the second bullet point above. We'll have to wait until the next time a leader election happens and see if we miss any data from it.
It happened again on:
2014-08-24 ~16:00 (with recovery for the next reading in ganglia. Since ganglia shows a decrease of volume for that time, it might align with the kafkatee changes on analytics1003. So it might actually be an artifact.) 2014-08-30 ~03:30 (with recovery ~06:00) 2014-09-02 ~10:50 (not yet recovered)
(I haven't yet checked if those instances also caused packetloss)
(In reply to christian from comment #4)
2014-08-24 ~16:00 (with recovery for the next reading in
Wrong day. It should read
2014-08-25 ~16:00 (with recovery for the next reading in
. Sorry.
(In reply to Andrew Otto from comment #3)
Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds.
Leader election can take up to 3 seconds. Hopefully this will solve the
second bullet point above. We'll have to wait until the next time a leader
election happens and see if we miss any data from it.
The re-elections from
2014-08-30 ~03:30 2014-09-02 ~10:50
(see comment #5) happened past the buffer increase, but both still exhibit
missing sequence numbers. So bumping the buffer size did not do the trick :-(
otto wrote:
I have done some more sleuthing on the zookeeper timeouts today, and I want to record my thoughts.
and (on zookeeper leader)
[ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x46fd72a6d4243f type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/kafka/eqiad/controller Error:KeeperErrorCode = NodeExists for /kafka/eqiad/controller
error messages. That is, when the timed-out broker finally is able to reconnect (with a new session id), it attempts to re-register its previous controller status with zookeeper, only to find that another broker has taken over as controller.
I just went back to my email thread to the Kafka Users group and noticed that someone had replied[1] to my sleuthing about broker GCs back in July! I never saw this email response! Gah! I will look into this suggestion tomorrow.
Adding some more records:
To rule out temporary basic network problems, I started a script to
ping all zookeepers from analytics1021 in an endless loop.
That produced results already:
There was no output for ~10 seconds between 15:26:04 and 15:26:14
But RTT times were low before and afterwards. Since we're hardly
having data to compare against, I collected the analytics1021 ganglia
graphs that showed a exceptional in-/decrease during that period, and
added the relevant parts of the kafka log and kafka gc log.
(See attachment analytics1021-2014-09-04T15-26-03.tar.gz)
It seems most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.
Since both the pings and kafka gc stopped during that time, it might
be it's not network related, but the machine itself preempting
processes and blocking them for some time.
So I started another monitor that (in an endless loop) logs the
timestamp, and then waits a second.
That already produced relevant results too:
There was no output for ~13 seconds between 20:47:18 and 20:47:31
.
Again most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.
Since I lack access to the system's logs, do they show anything
interesting around
15:26:04 -- 15:26:14, or 20:47:18 -- 20:47:31
?
Trying to understand why it only happens an analytics1021 and not the
other brokers, the only difference that I could find is that
analytic1021 is the only broker that does not have $ganglia_aggregator
set to true. I am not sure if that is relevant, but seeing the quoted
email from comment #7 ... could the $ganglia_aggregator setting change
the machine's characteristics enough to warrant analytics1021
exhibiting this strange behaviour?
otto wrote:
Yeah, strange indeed that this only happens on analytics1021. I *think* we have seen this elsewhere before, but not often. And, I think not since the cluster reinstall in July.
root@palladium:~# salt -E 'analytics10(12|18|21|22).eqiad.wmnet' cmd.run 'grep -i "timed out" /var/log/kafka/kafka.log | wc -l'
analytics1012.eqiad.wmnet:0analytics1018.eqiad.wmnet:
0analytics1021.eqiad.wmnet:
80analytics1022.eqiad.wmnet:
4
otto wrote:
This happened again on 2014-09-06 around 19:43:22. Here are some relevant sar captures:
https://gist.github.com/ottomata/ae372649afc914a6f606
Something is happening with paging stats, but it is unclear if it is a syptom or a cause.
Starting around 19:43:11, pages are being freed up really fast. Then, at 19:43:22, activity drops out, as if activity is blocked by something. At 19:43:33, a ton more pages are freed all at once Then it seems to go back to normal.
Seeing as the actual block starts at around :22, and there is weird behavior before that, maybe paging is part of the cause.
Change 163744 had a related patch set uploaded by QChris:
Force ACKs from all in-sync kafka replicas
(In reply to Toby Negrin from comment #15)
would you say that the fix did not solve this problem?
If you are refering to the change from comment 13, it totally did the trick.
For the machines it addresses, the issue is gone.
Those machines no longer loose data.
While the fix helps with the issue of losing data, it does not yet prevent
analytics1021 from dropping out.
Change 167550 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for mobile caches
Change 167551 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for text caches
Change 167552 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for bits caches
Change 167553 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers per default
Change 167550 merged by Ottomata:
Require 2 ACKs from kafka brokers for mobile caches
Happened again around 2015-02-07T01:06 . Ran leader-election and analytics1021 is back in the game.