Page MenuHomePhabricator

varnishkafka frequently disconnects from kafka servers
Closed, ResolvedPublic

Description

We have noticed that varnishkafka often disconnects from kafka servers. The issue is unrelated to the v4 upgrade, it has been observed on machines still running Varnish 3 as well as on servers running Varnish 4.

The peculiarity of the disconnections is that they seem to happen at fixed intervals of 5 or 10 minutes.

Aug 29 09:00:20 cp1052 varnishkafka[25020]: %3|1472461220.177|FAIL|varnishkafka#producer-1| kafka1012.eqiad.wmnet:9092/12: Receive failed: Disconnected
Aug 29 09:00:20 cp1052 varnishkafka[25020]: KAFKAERR: Kafka error (-195): kafka1012.eqiad.wmnet:9092/12: Receive failed: Disconnected
Aug 29 09:10:19 cp1052 varnishkafka[25020]: %3|1472461819.996|FAIL|varnishkafka#producer-1| kafka1018.eqiad.wmnet:9092/18: Receive failed: Disconnected
Aug 29 09:10:20 cp1052 varnishkafka[25020]: KAFKAERR: Kafka error (-195): kafka1018.eqiad.wmnet:9092/18: Receive failed: Disconnected
Aug 29 09:15:20 cp1052 varnishkafka[25020]: %3|1472462120.086|FAIL|varnishkafka#producer-1| kafka1012.eqiad.wmnet:9092/12: Receive failed: Disconnected
Aug 29 09:15:20 cp1052 varnishkafka[25020]: KAFKAERR: Kafka error (-195): kafka1012.eqiad.wmnet:9092/12: Receive failed: Disconnected

The problem started on Aug 08 at 18:30:27 (precisely), which is the same time as the following SAL entry:

18:30 ottomata: restarting kafka broker on kafka1013 to test eventlogging leader rebalances

Event Timeline

ema triaged this task as Medium priority.Aug 29 2016, 9:27 AM

Two kafka brokers seems to handle less traffic from the same date:

Screen Shot 2016-08-29 at 11.45.27 AM.png (641×1 px, 134 KB)

Maybe it is only a matter of rebalancing the kafka topic partition leaders?

EDIT: this happened on the 18th, so much later on. I am going to rebalance the brokers anyway, it will help.

Mentioned in SAL [2016-08-29T09:58:33Z] <elukey> Executed 'kafka preferred-replica-election' on kafka1012 to rebalance Kafka broker leaders (T144158)

There is also another interesting thing to notice:

  • from Aug 8th at ~18:30 UTC these messages were logged for various brokers:
...
Aug 08 18:30:28 cp3008 varnishkafka[26460]: %3|1470681028.980|FAIL|varnishkafka#producer-1| kafka1013.eqiad.wmnet:9092/13: 
   Connect to ipv6#[2620:0:861:104:10:64:5:13]:9092 failed: Connection refused
Aug 08 18:30:28 cp3008 varnishkafka[26460]: KAFKAERR: Kafka error (-195): kafka1013.eqiad.wmnet:9092/13: 
    Connect to ipv6#[2620:0:861:104:10:64:5:13]:9092 failed: Connection refused
...
  • from Aug 18th at 12:17 UTC the messages changed to:
...
Aug 18 12:17:35 cp3008 varnishkafka[29405]: KAFKADR: Kafka message delivery error: Local: Message timed out
Aug 18 13:05:50 cp3008 varnishkafka[29405]: %3|1471525550.966|FAIL|varnishkafka#producer-1| kafka1018.eqiad.wmnet:9092/18: Receive failed: Disconnected
Aug 18 13:05:50 cp3008 varnishkafka[29405]: KAFKAERR: Kafka error (-195): kafka1018.eqiad.wmnet:9092/18: Receive failed: Disconnected
...

From the SAL for the 18th I can see:

12:13 paravoid: cr1-eqiad: reenabling all asw row A/B/C interfaces
12:05 paravoid: cr1-eqiad: disabling all asw row B/C interfaces
12:04 paravoid: cr1-eqiad: disabling all asw row A interfaces
12:01 paravoid: cr1-eqiad: disabling Transit/Fundraising interfaces

Disconnects might happen we Kafka topic partition leaders change due to rebalancing, meanwhile a connection refused is a more severe error.

Most of the errors seemed to be related to kafka1012 and kafka1018, the ones that were not acting as leaders. Rebalancing the brokers helped and the disconnects stopped.

Not sure what was the issue before the 18th with connection refused, but it was mostly network related and I haven't seen any other occurrences.