Page MenuHomePhabricator

Jmxtrans failures on Kafka hosts caused metric holes in grafana
Closed, ResolvedPublic

Description

Today we saw a surge of connection tracked by all the kafka hosts, leading to:

12:54  <icinga-wm> PROBLEM - Check size of conntrack table on kafka1013 is CRITICAL: CRITICAL: nf_conntrack is 92 % full

The problem was that kafka1013 was the only host with net.netfilter.nf_conntrack_max set to 256k instead of 524k as the other brokers, triggering package drop with ~300k connection tracked. After fixing the issue and following up with sysctl rules in puppet, we noticed that there were huge holes in Kafka metrics in Grafana for all the kafka brokers. A quick check on kafka1013 solved the mistery:

elukey@kafka1013:/var/log/jmxtrans$ tail -n 30 jmxtrans.log
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
[27 May 2016 11:25:20] [ServerScheduler_Worker-4] 252945958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error
java.nio.BufferOverflowException
	at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183)
	at java.nio.ByteBuffer.put(ByteBuffer.java:832)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doSend(StatsDWriter.java:174)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doWrite(StatsDWriter.java:152)
	at com.googlecode.jmxtrans.util.JmxUtils.runOutputWritersForQuery(JmxUtils.java:336)
	at com.googlecode.jmxtrans.util.JmxUtils.processQuery(JmxUtils.java:206)
	at com.googlecode.jmxtrans.util.JmxUtils.processQueriesForServer(JmxUtils.java:120)
	at com.googlecode.jmxtrans.util.JmxUtils.processServer(JmxUtils.java:470)
	at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:39)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
[27 May 2016 11:25:35] [ServerScheduler_Worker-5] 252960958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error
java.nio.BufferOverflowException
	at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183)
	at java.nio.ByteBuffer.put(ByteBuffer.java:832)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doSend(StatsDWriter.java:174)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doWrite(StatsDWriter.java:152)
	at com.googlecode.jmxtrans.util.JmxUtils.runOutputWritersForQuery(JmxUtils.java:336)
	at com.googlecode.jmxtrans.util.JmxUtils.processQuery(JmxUtils.java:206)
	at com.googlecode.jmxtrans.util.JmxUtils.processQueriesForServer(JmxUtils.java:120)
	at com.googlecode.jmxtrans.util.JmxUtils.processServer(JmxUtils.java:470)
	at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:39)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
[27 May 2016 11:25:42] [main] 0      INFO  (com.googlecode.jmxtrans.JmxTransformer:134) - Starting Jmxtrans on : /etc/jmxtrans
[27 May 2016 11:27:30] [main] 0      INFO  (com.googlecode.jmxtrans.JmxTransformer:134) - Starting Jmxtrans on : /etc/jmxtrans

After a chat with @Gehel a couple of things came up:

  1. Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us?
  2. The jmxtrans versions installed is a bit ancient, we should upgrade to a newer one (a new release is coming with a lot of fixes during the next weeks, hopefully :)

Event Timeline

We didn't upgrade to a newer JMXtrans because of a verbose logging bug.

Buuut! It looks like it has been fixed?

https://github.com/jmxtrans/jmxtrans/issues/215

Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us?

Good question! Perhaps we don't!

the typical (only?) reasons for pushing to statsd is for aggregation across machines sending the metrics, or aggregation for a particular type of metric is needed. If all metrics already include the machine name and don't need aggregation then yes they could be pushed to graphite instead

mforns subscribed.

@elukey
Can you clarify what is the action to do in this task?
Thanks!

@mforns: Sure! I think that we should follow up on the items in the task's description, namely:

  1. follow up with upstream and package/test/deploy the new version of jmxtrans since the one that we have is ancient. @Gehel is of course our main point of contact, I'll sync with him to figure out what version is the recommended one (there are some critical bugs still to be closed so we could wait for the upcoming release).
  1. Test and evaluate if we could removed statsd support because jmxtrans should theoretically do its work out of the box. The failure that triggered this ticket happened because of the jmxtrans' statsd component.

I'm doing a release of jmxtrans right now. This come with a few fixes to the stability of the graphite and statsd writers, including moving to a different resource pool implementation. A few notes:

The old graphite / statsd writers implementation are still there and untouched, a new one has been added (GraphiteWriterFactory / StatsdWriterFactory), slightly documented on jmxtrans wiki. Our jmxtrans configuration should use those new writers. Note that I have not had the chance to use them myself on production environments, things might break.

jmxtrans publishes .deb packages for some time. They are probably not up to any standard, but I'm happy to merge any PR to improve them, or look at any bug report.

elukey claimed this task.

All the next steps outlined in https://phabricator.wikimedia.org/T73322, we can close this task.