Page MenuHomePhabricator

Kafka API negotiation errors on kafka main brokers
Closed, ResolvedPublic

Description

@mobrovac and I were checking some eventbus alarms due tu Kafka timeouts, and I found this in the logs:

[2018-04-26 23:40:02,978] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.ArrayIndexOutOfBoundsException: 18
        at org.apache.kafka.common.protocol.ApiKeys.forId(ApiKeys.java:68)
        at org.apache.kafka.common.requests.AbstractRequest.getRequest(AbstractRequest.java:39)
        at kafka.network.RequestChannel$Request.<init>(RequestChannel.scala:79)
        at kafka.network.Processor$$anonfun$run$11.apply(SocketServer.scala:426)
        at kafka.network.Processor$$anonfun$run$11.apply(SocketServer.scala:421)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at kafka.network.Processor.run(SocketServer.scala:421)
        at java.lang.Thread.run(Thread.java:748)
[..]
[2018-04-27 00:40:42,514] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.ArrayIndexOutOfBoundsException
[..]

The start timing is the same in eqiad and codfw, ~23:30 UTC on the 26th. We have been seeing this error in the past, since it was leading to timeouts and troubles for the Analytics cluster:

https://phabricator.wikimedia.org/T172681

The main issue was a Kafka client running a recent version of the Kafka API protocol (> 0.9) that was trying to negotiate the Kafka version with a Kafka 0.9 cluster. Since 0.9 clusters do not support this feature, they end up in timeouts. librdkafka put a workaround in place:

https://github.com/edenhill/librdkafka/wiki/Broker-version-compatibility

From what I can see from the IRC logs, yesterday the webperf nodes came to live (first puppet runs) at around the same time, after merging:

https://gerrit.wikimedia.org/r/#/c/429242/
https://phabricator.wikimedia.org/T186774

As far as I can see webperf uses python-kafka 1.4.1, and the KafkaConsumer class is not explicitly setting the Kafka API version (0.9) in this case.

@Imarlier, @Krinkle could you guys review what I wrote above and let me know if it makes sense?

Event Timeline

elukey triaged this task as High priority.Apr 27 2018, 12:43 PM
elukey created this task.
Restricted Application added a project: Analytics. · View Herald TranscriptApr 27 2018, 12:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@elukey yes, makes sense - I'll fix in a little bit. Sorry for the noise!

Change 429412 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/statsv@master] Add the possibility to specify the Kafka API version to KafkaConsumer

https://gerrit.wikimedia.org/r/429412

Ah nice! I sent a code review as attempt to fix this, but I can abandon it if you have something ready, no problem!

Change 429432 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[analytics/statsv@master] statsv: Hardcode kafka api version

https://gerrit.wikimedia.org/r/429432

Change 429412 abandoned by Elukey:
Add the possibility to specify the Kafka API version to KafkaConsumer

https://gerrit.wikimedia.org/r/429412

Change 429432 merged by Imarlier:
[analytics/statsv@master] statsv: Hardcode kafka api version

https://gerrit.wikimedia.org/r/429432

elukey closed this task as Resolved.Apr 27 2018, 4:44 PM

Changes deployed by @Imarlier, everything looks good now! Thanks!

Vvjjkkii renamed this task from Kafka API negotiation errors on kafka main brokers to l3daaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Imarlier as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
AfroThundr3007730 renamed this task from l3daaaaaaa to Kafka API negotiation errors on kafka main brokers.Jul 1 2018, 6:11 AM
AfroThundr3007730 closed this task as Resolved.
AfroThundr3007730 assigned this task to Imarlier.
AfroThundr3007730 updated the task description. (Show Details)
AfroThundr3007730 added subscribers: GerritBot, Aklapper.