Page MenuHomePhabricator

Investigate recent Kafka Burrow alarms for EventLogging
Closed, ResolvedPublic

Description

Occurred two times up to now during the following timeframes:

  • 17/03 ~ midnight UTC
  • 20/03 ~ 03:17:41 UTC

For the first one (reporting an email to analytics-internal):

Found the following error in eventlog1001:

2017-03-17 23:58:36,111 [2728] (MainThread) <BrokerConnection host=kafka1022.eqiad.wmnet/2620:0:861:106:10:64:36:122 port=9092> timed out after 40000 ms. Closing connection.

[..other errors..]

On Kafka1022 a lot of the following error messages started before the above timeout:

[2017-03-17 23:00:13,317] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.ArrayIndexOutOfBoundsException: 18

at org.apache.kafka.common.protocol.ApiKeys.forId(ApiKeys.java:68)

[..]

This also caused MirrorMaker process down failures. Could be related to https://issues.apache.org/jira/browse/KAFKA-3593, maybe something also due to the last librdkafka?

From https://github.com/edenhill/librdkafka/wiki/Broker-version-compatibility it seems that we should set broker.version.fallback=0.9.0.1 everywhere (default broker.version.fallback=0.9.0.0), but it might be something different.

Event Timeline

it seems that we should set broker.version.fallback=0.9.0.1 everywhere

Hm, maybe, except that the burrow alarms we got are all for consumer processes, which use kafka-python, not the librdkafka based confluent-kafka-python. kafka-python consumers do have an api_version option, which we set to 0.9. (They don't specify anything more specific.)

This reminds me though...we upgraded librdkafka everywhere except for eventlog1001! It is still running 0.9.1. I don't think that is related to this problem at all though.

In an email, you also reported that eventlogging actually died when you restarted a broker too, right? Yikes! Since these are all consumer problems, it is unlikely that we are dropping any messages, but this type of thing might cause a few duplicates.

I'm not sure why this would be happening all of the sudden. We have been doing some eventlogging deploys recently (around UA parser changes), but nothing that should have affected kafka stuff.

Nono everything auto-recovered by itself, I just reported the errors in the task/email. The only thing that died was mirrormaker but after the puppet run it went up fine.

This reminds me though...we upgraded librdkafka everywhere except for eventlog1001! It is still running 0.9.1. I don't think that is related to this problem at all though.

Can we update librdkafka on eventlog1001 too? It would be one less thing to check :)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:27:04Z] <elukey> Temporary hack for T160886 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention)

The prev comment is of course not related, sorry.

elukey claimed this task.

Didn't re-occur and after a chat with Andrew we didn't find any good root cause. Inclined to close this issue as one-time-only, will re-open it in case it re-occurs.