Today we saw a surge of connection tracked by all the kafka hosts, leading to:
12:54 <icinga-wm> PROBLEM - Check size of conntrack table on kafka1013 is CRITICAL: CRITICAL: nf_conntrack is 92 % full
The problem was that kafka1013 was the only host with net.netfilter.nf_conntrack_max set to 256k instead of 524k as the other brokers, triggering package drop with ~300k connection tracked. After fixing the issue and following up with sysctl rules in puppet, we noticed that there were huge holes in Kafka metrics in Grafana for all the kafka brokers. A quick check on kafka1013 solved the mistery:
elukey@kafka1013:/var/log/jmxtrans$ tail -n 30 jmxtrans.log at org.quartz.core.JobRunShell.run(JobRunShell.java:216) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549) [27 May 2016 11:25:20] [ServerScheduler_Worker-4] 252945958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error java.nio.BufferOverflowException at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183) at java.nio.ByteBuffer.put(ByteBuffer.java:832) at com.googlecode.jmxtrans.model.output.StatsDWriter.doSend(StatsDWriter.java:174) at com.googlecode.jmxtrans.model.output.StatsDWriter.doWrite(StatsDWriter.java:152) at com.googlecode.jmxtrans.util.JmxUtils.runOutputWritersForQuery(JmxUtils.java:336) at com.googlecode.jmxtrans.util.JmxUtils.processQuery(JmxUtils.java:206) at com.googlecode.jmxtrans.util.JmxUtils.processQueriesForServer(JmxUtils.java:120) at com.googlecode.jmxtrans.util.JmxUtils.processServer(JmxUtils.java:470) at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:39) at org.quartz.core.JobRunShell.run(JobRunShell.java:216) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549) [27 May 2016 11:25:35] [ServerScheduler_Worker-5] 252960958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error java.nio.BufferOverflowException at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183) at java.nio.ByteBuffer.put(ByteBuffer.java:832) at com.googlecode.jmxtrans.model.output.StatsDWriter.doSend(StatsDWriter.java:174) at com.googlecode.jmxtrans.model.output.StatsDWriter.doWrite(StatsDWriter.java:152) at com.googlecode.jmxtrans.util.JmxUtils.runOutputWritersForQuery(JmxUtils.java:336) at com.googlecode.jmxtrans.util.JmxUtils.processQuery(JmxUtils.java:206) at com.googlecode.jmxtrans.util.JmxUtils.processQueriesForServer(JmxUtils.java:120) at com.googlecode.jmxtrans.util.JmxUtils.processServer(JmxUtils.java:470) at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:39) at org.quartz.core.JobRunShell.run(JobRunShell.java:216) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549) [27 May 2016 11:25:42] [main] 0 INFO (com.googlecode.jmxtrans.JmxTransformer:134) - Starting Jmxtrans on : /etc/jmxtrans [27 May 2016 11:27:30] [main] 0 INFO (com.googlecode.jmxtrans.JmxTransformer:134) - Starting Jmxtrans on : /etc/jmxtrans
After a chat with @Gehel a couple of things came up:
- Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us?
- The jmxtrans versions installed is a bit ancient, we should upgrade to a newer one (a new release is coming with a lot of fixes during the next weeks, hopefully :)