Jmxtrans failures on Kafka hosts caused metric holes in grafana
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	May 27 2016, 12:32 PM

Description

Today we saw a surge of connection tracked by all the kafka hosts, leading to:

12:54  <icinga-wm> PROBLEM - Check size of conntrack table on kafka1013 is CRITICAL: CRITICAL: nf_conntrack is 92 % full

The problem was that kafka1013 was the only host with net.netfilter.nf_conntrack_max set to 256k instead of 524k as the other brokers, triggering package drop with ~300k connection tracked. After fixing the issue and following up with sysctl rules in puppet, we noticed that there were huge holes in Kafka metrics in Grafana for all the kafka brokers. A quick check on kafka1013 solved the mistery:

elukey@kafka1013:/var/log/jmxtrans$ tail -n 30 jmxtrans.log
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
[27 May 2016 11:25:20] [ServerScheduler_Worker-4] 252945958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error
java.nio.BufferOverflowException
	at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183)
	at java.nio.ByteBuffer.put(ByteBuffer.java:832)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doSend(StatsDWriter.java:174)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doWrite(StatsDWriter.java:152)
	at com.googlecode.jmxtrans.util.JmxUtils.runOutputWritersForQuery(JmxUtils.java:336)
	at com.googlecode.jmxtrans.util.JmxUtils.processQuery(JmxUtils.java:206)
	at com.googlecode.jmxtrans.util.JmxUtils.processQueriesForServer(JmxUtils.java:120)
	at com.googlecode.jmxtrans.util.JmxUtils.processServer(JmxUtils.java:470)
	at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:39)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
[27 May 2016 11:25:35] [ServerScheduler_Worker-5] 252960958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error
java.nio.BufferOverflowException
	at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183)
	at java.nio.ByteBuffer.put(ByteBuffer.java:832)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doSend(StatsDWriter.java:174)
	at com.googlecode.jmxtrans.model.output.StatsDWriter.doWrite(StatsDWriter.java:152)
	at com.googlecode.jmxtrans.util.JmxUtils.runOutputWritersForQuery(JmxUtils.java:336)
	at com.googlecode.jmxtrans.util.JmxUtils.processQuery(JmxUtils.java:206)
	at com.googlecode.jmxtrans.util.JmxUtils.processQueriesForServer(JmxUtils.java:120)
	at com.googlecode.jmxtrans.util.JmxUtils.processServer(JmxUtils.java:470)
	at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:39)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
[27 May 2016 11:25:42] [main] 0      INFO  (com.googlecode.jmxtrans.JmxTransformer:134) - Starting Jmxtrans on : /etc/jmxtrans
[27 May 2016 11:27:30] [main] 0      INFO  (com.googlecode.jmxtrans.JmxTransformer:134) - Starting Jmxtrans on : /etc/jmxtrans

After a chat with @Gehel a couple of things came up:

Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us?
The jmxtrans versions installed is a bit ancient, we should upgrade to a newer one (a new release is coming with a lot of fixes during the next weeks, hopefully :)

Related Objects

Mentioned Here: T73322: Switch jmxtrans from statsd to graphite line protocol
T97277: kafka statsd metrics rate calculation yields double underflow

Event Timeline

elukey created this task.May 27 2016, 12:32 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 27 2016, 12:32 PM

elukey added projects: Analytics, SRE.May 27 2016, 12:33 PM

elukey added subscribers: Ottomata, JAllemandou.

We didn't upgrade to a newer JMXtrans because of a verbose logging bug.

Buuut! It looks like it has been fixed?

https://github.com/jmxtrans/jmxtrans/issues/215

Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us?

Good question! Perhaps we don't!

the typical (only?) reasons for pushing to statsd is for aggregation across machines sending the metrics, or aggregation for a particular type of metric is needed. If all metrics already include the machine name and don't need aggregation then yes they could be pushed to graphite instead

@elukey
Can you clarify what is the action to do in this task?
Thanks!

@mforns: Sure! I think that we should follow up on the items in the task's description, namely:

follow up with upstream and package/test/deploy the new version of jmxtrans since the one that we have is ancient. @Gehel is of course our main point of contact, I'll sync with him to figure out what version is the recommended one (there are some critical bugs still to be closed so we could wait for the upcoming release).

Test and evaluate if we could removed statsd support because jmxtrans should theoretically do its work out of the box. The failure that triggered this ticket happened because of the jmxtrans' statsd component.

Thanks @elukey!

Milimetric edited projects, added Analytics; removed Analytics-Kanban.Jun 2 2016, 4:46 PM

Milimetric moved this task from Incoming to Dashiki on the Analytics board.

Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.Jun 2 2016, 4:57 PM

Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.

• Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 4 2016, 4:56 PM

MoritzMuehlenhoff triaged this task as Medium priority.Jul 6 2016, 2:21 PM

I'm doing a release of jmxtrans right now. This come with a few fixes to the stability of the graphite and statsd writers, including moving to a different resource pool implementation. A few notes:

The old graphite / statsd writers implementation are still there and untouched, a new one has been added (GraphiteWriterFactory / StatsdWriterFactory), slightly documented on jmxtrans wiki. Our jmxtrans configuration should use those new writers. Note that I have not had the chance to use them myself on production environments, things might break.

jmxtrans publishes .deb packages for some time. They are probably not up to any standard, but I'm happy to merge any PR to improve them, or look at any bug report.

jmxtrans 259 is now released: http://central.maven.org/maven2/org/jmxtrans/jmxtrans/259/

thanks @Gehel ! this would likely allow us to fix T97277: kafka statsd metrics rate calculation yields double underflow too

@fgiunchedi Yep, that fix has been merged.

@Gehel can we sync when you have time to establish next steps?

see also T73322: Switch jmxtrans from statsd to graphite line protocol about switching statsd -> graphite, once the upgrade is done

All the next steps outlined in https://phabricator.wikimedia.org/T73322, we can close this task.

Jmxtrans failures on Kafka hosts caused metric holes in grafanaClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Jmxtrans failures on Kafka hosts caused metric holes in grafana
Closed, ResolvedPublic
Actions