@mobrovac and I were checking some eventbus alarms due tu Kafka timeouts, and I found this in the logs:
[2018-04-26 23:40:02,978] ERROR Processor got uncaught exception. (kafka.network.Processor) java.lang.ArrayIndexOutOfBoundsException: 18 at org.apache.kafka.common.protocol.ApiKeys.forId(ApiKeys.java:68) at org.apache.kafka.common.requests.AbstractRequest.getRequest(AbstractRequest.java:39) at kafka.network.RequestChannel$Request.<init>(RequestChannel.scala:79) at kafka.network.Processor$$anonfun$run$11.apply(SocketServer.scala:426) at kafka.network.Processor$$anonfun$run$11.apply(SocketServer.scala:421) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at kafka.network.Processor.run(SocketServer.scala:421) at java.lang.Thread.run(Thread.java:748) [..] [2018-04-27 00:40:42,514] ERROR Processor got uncaught exception. (kafka.network.Processor) java.lang.ArrayIndexOutOfBoundsException [..]
The start timing is the same in eqiad and codfw, ~23:30 UTC on the 26th. We have been seeing this error in the past, since it was leading to timeouts and troubles for the Analytics cluster:
The main issue was a Kafka client running a recent version of the Kafka API protocol (> 0.9) that was trying to negotiate the Kafka version with a Kafka 0.9 cluster. Since 0.9 clusters do not support this feature, they end up in timeouts. librdkafka put a workaround in place:
From what I can see from the IRC logs, yesterday the webperf nodes came to live (first puppet runs) at around the same time, after merging:
As far as I can see webperf uses python-kafka 1.4.1, and the KafkaConsumer class is not explicitly setting the Kafka API version (0.9) in this case.