Page MenuHomePhabricator

Kafka logs drowning in kafka.server.KafkaApis errors processing fetch requests since 2014-10-28 ~20:00
Closed, DeclinedPublic

Description

Since 2014-10-28 ~20:00, the kafka logs on each four brokers are
drowning [1] in exceptions around processing fetch requests, like the
following one:

[2014-10-31 13:20:01,425] 7574685637 [kafka-request-handler-1] ERROR kafka.server.KafkaApis - [KafkaApi-22] Error when processing fetch request for partition [webrequest_text,8] offset 25962495797 from consumer with correlation id 2
kafka.common.OffsetOutOfRangeException: Request for offset 25962495797 but we only have log segments in the range 24383057344 to 25960520433.

at kafka.log.Log.read(Log.scala:380)
at kafka.server.KafkaApis.kafka$server$KafkaApis$$readMessageSet(KafkaApis.scala:530)
at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:476)
at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:471)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
at scala.collection.immutable.Map$Map3.foreach(Map.scala:163)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
at scala.collection.immutable.Map$Map3.map(Map.scala:143)
at kafka.server.KafkaApis.kafka$server$KafkaApis$$readMessageSets(KafkaApis.scala:471)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:437)
at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
at java.lang.Thread.run(Thread.java:744)

The exceptions are really strongly biased towards the first second of a minute. [2]
This happens every minute, with some 60 extra exceptions every 5th minute. [3]

[1]

qchris@analytics1012:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c

   6 [2014-08-21
  18 [2014-08-22
   9 [2014-08-28
  12 [2014-09-09
   3 [2014-09-16
  47 [2014-09-17
  28 [2014-09-18
   6 [2014-10-08
  30 [2014-10-16
  12 [2014-10-23
1563 [2014-10-28
8640 [2014-10-29
8610 [2014-10-30
 834 [2014-10-31

qchris@analytics1018:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c

   6 [2014-08-12
   6 [2014-08-21
  18 [2014-08-22
   9 [2014-08-28
  12 [2014-09-09
   3 [2014-09-16
  45 [2014-09-17
  31 [2014-09-18
   6 [2014-10-08
  30 [2014-10-16
  12 [2014-10-23
1527 [2014-10-28
8640 [2014-10-29
8610 [2014-10-30
 834 [2014-10-31

chris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -f 1 -d ' ' | uniq -c

   4 [2014-06-24
   2 [2014-07-02
   1 [2014-08-04
   4 [2014-08-12
   4 [2014-08-21
  10 [2014-08-22
   6 [2014-08-28
  10 [2014-09-09
   2 [2014-09-16
  32 [2014-09-17
  21 [2014-09-18
   6 [2014-10-08
  30 [2014-10-16
  12 [2014-10-23
1380 [2014-10-28
8640 [2014-10-29
8610 [2014-10-30
 834 [2014-10-31

qchris@analytics1022:~$ grep 'ERROR kafka.server.KafkaApis' /var/log/kafka/kafka.log | cut -f 1 -d ' ' | uniq -c

   9 [2014-09-09
   3 [2014-09-16
  34 [2014-09-17
  20 [2014-09-18
   6 [2014-10-08
  38 [2014-10-16
  10 [2014-10-23
 508 [2014-10-28
2890 [2014-10-29
2877 [2014-10-30
 280 [2014-10-31

[2]
qchris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -c 1-60 | grep 2014-10 | cut -f 3 -d : | cut -f 1 -d , | sort | uniq -c

   12 00
17466 01
 1916 02
   82 03
    6 04
    6 06
    6 07
    6 23
    6 42
    6 46
    6 52
    6 56

[3]
qchris@analytics1021:~$ cat /var/log/kafka/kafka.log.1 /var/log/kafka/kafka.log | grep 'ERROR kafka.server.KafkaApis' | cut -c 1-60 | grep 2014-10 | cut -f 2 -d : | sort | uniq -c

380 00
314 01
314 02
314 03
314 04
368 05
314 06
312 07
312 08
312 09
366 10
306 11
306 12
318 13
318 14
378 15
318 16
318 17
318 18
318 19
372 20
312 21
318 22
318 23
318 24
378 25
312 26
312 27
312 28
312 29
372 30
312 31
312 32
312 33
312 34
372 35
312 36
312 37
312 38
312 39
366 40
312 41
312 42
312 43
312 44
372 45
318 46
312 47
312 48
312 49
366 50
312 51
318 52
312 53
312 54
366 55
312 56
312 57
320 58
332 59

Version: unspecified
Severity: normal

Details

Reference
bz72812

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:53 AM
bzimport set Reference to bz72812.
bzimport added a subscriber: Unknown Object (MLST).

The time 2014-10-28 ~20:00 nicely matches the leader re-election of
bug 72550 comment 1.

otto wrote:

Ah, this is caused by Ellery's use of kafkacat on stat1002. It looks like kafkacat does not properly create the offset directory if it doesn't arleady exist. Instead, it creates a single file and stored some number in there. That number is the offset being requested in these error messages. I've advised Ellery to create the offset directory before running kafkacat. I will also report this as a bug to Magnus on kafkacat's github home.

Yay!
Logs are much cleaner now again.
Thanks!

(In reply to Andrew Otto from comment #2)

I
will also report this as a bug to Magnus on kafkacat's github home.

It seems the ticket is at:

https://github.com/edenhill/kafkacat/issues/9

(In reply to Andrew Otto from comment #2)

Ah, this is caused by Ellery's use of kafkacat on stat1002.

Yikes!

kafkacat not properly creating offset directories is ... well a
bug. Kafkacat is fresh, so I figure these kinds of issues are
expected. No worries.

And the logs are clean right now.

However, I am thinking this bug shows three other important things,
that should get discussed before closing the bug:

  • kafkacat can get our kafka (not kafkacat) logs to become unusable. That came somewhat as a surprise to me. Do we expect people to use kafkacat more and we need to guard against noise in the logs, or will kafkacat's use die out (except for debugging) and people use ETL-ed data or kafkatee instead?
  • kafkacat did not recover. Even if kafkacat does not properly survive the leader re-election, I had expected it to recover at some point. But in the end, it did not recover.

    This item can be explained away, by seeing that kafkacat really was run in acron, and hence restarting again and again with the faulty offsets.
  • How comes it happened right after the leader election? The timing matches too well to be a simple coincidence. Did the use af kafkacat really start right at the time we re-elected? Do we know how kafkacat (maybe kafkatee) survives re-elecetions?

Hence, re-opening, as I think these three items need more discussion.

otto wrote:

  • will kafkacat's use die out

I don't think we will officially support regular users to use kafkacat. kafkatee is much better and more useful. kafkacat is really nice for debugging. And, if it doesn't work, we can always use the kafka console-consumer that ships with Kafka. Ellery was just using this kafkatee hadn't been available. kafkacat isn't even officially installed anywhere. He was just using a statically compiled binary in my homedir on stat1002.

  • How comes it happened right after the leader election?

I am likely to be wrong, but I would think it is a coincidence. I had been corresponding with Ellery last week about how he can use kafkatee, and I think he started to use it around the same time. You said the logs started to appear around 10 minutes or so after the election, right? Again, I have not investigated this, and I could be wrong.

Hence, re-opening, as I think these three items need more discussion.

I think we can close this. Unless we intend to vet and support kafkacat in production, we don't need to track issues here.

Thanks for the follow-up.

I guess you're right about closing, if we don't intend to support kafkacat.