Investigate recent Kafka Burrow alarms for EventLogging
Open, NormalPublic

Description

Occurred two times up to now during the following timeframes:

  • 17/03 ~ midnight UTC
  • 20/03 ~ 03:17:41 UTC

For the first one (reporting an email to analytics-internal):

Found the following error in eventlog1001:

2017-03-17 23:58:36,111 [2728] (MainThread) <BrokerConnection host=kafka1022.eqiad.wmnet/2620:0:861:106:10:64:36:122 port=9092> timed out after 40000 ms. Closing connection.

[..other errors..]

On Kafka1022 a lot of the following error messages started before the above timeout:

[2017-03-17 23:00:13,317] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.ArrayIndexOutOfBoundsException: 18

at org.apache.kafka.common.protocol.ApiKeys.forId(ApiKeys.java:68)

[..]

This also caused MirrorMaker process down failures. Could be related to https://issues.apache.org/jira/browse/KAFKA-3593, maybe something also due to the last librdkafka?

From https://github.com/edenhill/librdkafka/wiki/Broker-version-compatibility it seems that we should set broker.version.fallback=0.9.0.1 everywhere (default broker.version.fallback=0.9.0.0), but it might be something different.

elukey created this task.Mar 20 2017, 7:43 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2017, 7:43 AM
elukey edited the task description. (Show Details)Mar 20 2017, 7:46 AM

it seems that we should set broker.version.fallback=0.9.0.1 everywhere

Hm, maybe, except that the burrow alarms we got are all for consumer processes, which use kafka-python, not the librdkafka based confluent-kafka-python. kafka-python consumers do have an api_version option, which we set to 0.9. (They don't specify anything more specific.)

This reminds me though...we upgraded librdkafka everywhere except for eventlog1001! It is still running 0.9.1. I don't think that is related to this problem at all though.

In an email, you also reported that eventlogging actually died when you restarted a broker too, right? Yikes! Since these are all consumer problems, it is unlikely that we are dropping any messages, but this type of thing might cause a few duplicates.

I'm not sure why this would be happening all of the sudden. We have been doing some eventlogging deploys recently (around UA parser changes), but nothing that should have affected kafka stuff.

Nono everything auto-recovered by itself, I just reported the errors in the task/email. The only thing that died was mirrormaker but after the puppet run it went up fine.

This reminds me though...we upgraded librdkafka everywhere except for eventlog1001! It is still running 0.9.1. I don't think that is related to this problem at all though.

Can we update librdkafka on eventlog1001 too? It would be one less thing to check :)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:27:04Z] <elukey> Temporary hack for T160886 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention)

The prev comment is of course not related, sorry.

Nuria moved this task from Incoming to Q4 (April 2017) on the Analytics board.Mar 20 2017, 4:00 PM