Move away from jmxtrans in favor of prometheus jmx_exporter
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

None

Authored By

	elukey
	Sep 8 2017, 10:07 AM

Description

We are currently using jmxtrans deployed on most of our nodes to export jmx metrics to graphite, but the current state of our configuration is not great since:

We are far behind from upstream, so some work would be needed to update jmxtrans to a more up to date version. Last issue was https://gerrit.wikimedia.org/r/#/c/376663 with the Kafka Jumbo cluster.

Ops is considering Prometheus the way to go for the future, so it might make sense to plan/move to it sooner rather than later.

The prometheus jmx exporter is already used in Wikimedia (deployed via scap and puppetized).

Details

	Subject	Repo	Branch	Lines +/-
	Add default_prometheus_jmx_exporter.yaml	operations/puppet	production	+29 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T175344 Move away from jmxtrans in favor of prometheus jmx_exporter
Resolved	elukey	T177458 Add the prometheus jmx exporter to all the Hadoop daemons
Resolved	elukey	T177459 Add a prometheus metric exporter to all the Druid daemons
Resolved	elukey	T177460 Add the prometheus jmx exporter to all the Zookeeper daemons
Resolved	elukey	T175922 Use Prometheus for Kafka JMX metrics instead of jmxtrans
Resolved	Ottomata	T175923 Port Kafka alerts from check_graphite to check_prometheus
Resolved	elukey	T177078 Decide on casing convention for JMX metrics in Prometheus
Resolved	elukey	T184794 Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie
Resolved	elukey	T184795 Add the prometheus jmx agent to AQS Cassandra
Resolved	elukey	T189529 Test/upload new cassandra 2.2.6 package (wmf3)

Event Timeline

elukey created this task.Sep 8 2017, 10:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2017, 10:07 AM

• Suhadakashter closed this task as a duplicate of T175367: Page wikipedia.Sep 8 2017, 2:07 PM

• Suhadakashter closed this task as a duplicate of T175367: Page wikipedia.

Reedy reopened this task as Open.Sep 8 2017, 2:12 PM

elukey added a project: User-Elukey.Sep 11 2017, 5:07 PM

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Sep 12 2017, 10:44 AM

Cassandra, zookeeper, druid, hadoop, kafka

• fdans moved this task from Incoming to Backlog (Later) on the Analytics board.Sep 21 2017, 4:42 PM

elukey created subtask T177458: Add the prometheus jmx exporter to all the Hadoop daemons.Oct 5 2017, 8:34 AM

elukey created subtask T177459: Add a prometheus metric exporter to all the Druid daemons.Oct 5 2017, 8:40 AM

elukey created subtask T177460: Add the prometheus jmx exporter to all the Zookeeper daemons.

elukey added a subtask: T175922: Use Prometheus for Kafka JMX metrics instead of jmxtrans.

• Nuria closed subtask T175922: Use Prometheus for Kafka JMX metrics instead of jmxtrans as Resolved.Oct 9 2017, 4:38 PM

Change 386190 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add default_prometheus_jmx_exporter.yaml

https://gerrit.wikimedia.org/r/386190

gerritbot added a project: Patch-For-Review.Oct 24 2017, 2:13 PM

Heya @fgiunchedi, https://gerrit.wikimedia.org/r/#/c/386190/ adds a default jmx exporter config file, which actually lets jmx exporter scrape and transform mbeans without requiring extra configs. It actually does a pretty good job!

Here's what it does for Kafka MirrorMaker:
https://gist.github.com/ottomata/6920c6efbf58483223ad27c5672f3025

@elukey said I should check with you to see if you think the above metrics are ok (labels, etc.) before proceeding. Whatcha think?

@Ottomata the metrics look good generally! A couple of things I noticed:

These for example are the same metric name repeated with or without topic, I'm assuming the one without topic is the sum all topics. For a given metric name there should be always the same set of labels or things get confusing. I think you can either rename the metric with say kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate_topic or drop if you are not interested in per-topic (or conversely not interested in the sum)

kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{client_id="kafka-mirror-k1_to_k2-0",topic="test1",} 4.820083023470853
kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{client_id="kafka-mirror-k1_to_k2-0",} 4.819324507740425

These seem to have a label with potentially high cardinality, do you know how node_id changes over time?

kafka_consumer_consumer_node_metrics_incoming_byte_rate{client_id="kafka-mirror-k1_to_k2-0",node_id="node--1",} 0.0
kafka_consumer_consumer_node_metrics_incoming_byte_rate{client_id="kafka-mirror-k1_to_k2-0",node_id="node-11",} 137.8430494146978
kafka_consumer_consumer_node_metrics_incoming_byte_rate{client_id="kafka-mirror-k1_to_k2-0",node_id="node-2147483636",} 6.765667106243908

re: the default jmx config I think it is better to be explicit rather than implicit, we could provide an out of the box config to be extended but I suspect each software will have to tweak it slightly anyways

These seem to have a label with potentially high cardinality, do you know how node_id changes over time?

Ah, K had to look this up. These are per-broker metrics, so it should remain low. I've seen -1 mean bootstrap metrics, so these will likely always be 0. The 11 is actually the broker.id of the one broker I'm consuming from here. I dunno what 2147483636. But, in any case, there should only be about the same number of node-ids as there are brokers, so, in our case, 6(+2?)! https://docs.confluent.io/current/kafka/monitoring.html#id3

For a given metric name there should be always the same set of labels or things get confusing.

Just curious, in what way do they get confusing?

We can probably just rename the sum on all topics to kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_total_rate.

Change 386190 abandoned by Ottomata:
Add default_prometheus_jmx_exporter.yaml

Reason:
Godog wants explicit oh welllll

https://gerrit.wikimedia.org/r/386190

Ottomata mentioned this in T177216: Mirror topics from main Kafka clusters (from main-eqiad) into jumbo-eqiad.Oct 25 2017, 7:45 PM

In T175344#3709806, @Ottomata wrote:

These seem to have a label with potentially high cardinality, do you know how node_id changes over time?

Ah, K had to look this up. These are per-broker metrics, so it should remain low. I've seen -1 mean bootstrap metrics, so these will likely always be 0. The 11 is actually the broker.id of the one broker I'm consuming from here. I dunno what 2147483636. But, in any case, there should only be about the same number of node-ids as there are brokers, so, in our case, 6(+2?)! https://docs.confluent.io/current/kafka/monitoring.html#id3

Ok! Seems to be manageable indeed.

For a given metric name there should be always the same set of labels or things get confusing.

Just curious, in what way do they get confusing?

In this particular case for example aggregations on the metric (e.g. sum()) will return wrong results (each topic plus the sum of all topics, summed)

We can probably just rename the sum on all topics to kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_total_rate.

Yep that's an option as well

• Nuria closed subtask T177459: Add a prometheus metric exporter to all the Druid daemons as Resolved.Nov 28 2017, 6:22 PM

• Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 3 2018, 10:48 PM

• Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.

• Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.

• fdans moved this task from Paused to Parent Tasks on the Analytics-Kanban board.Jan 4 2018, 5:34 PM

elukey moved this task from Analytics Backlog to Keep an eye on it on the User-Elukey board.Jan 12 2018, 3:16 PM

• Nuria closed subtask T177458: Add the prometheus jmx exporter to all the Hadoop daemons as Resolved.Feb 12 2018, 3:57 PM

• Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM

• Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.

• Nuria edited projects, added Analytics; removed Analytics-Kanban.

• Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.

• Nuria edited projects, added Analytics-Kanban; removed Analytics.

Ryan.etree moved this task from Parent Tasks to Next Up on the Analytics-Kanban board.Mar 15 2018, 2:43 AM

• fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Mar 22 2018, 4:50 PM

• Nuria closed subtask T184795: Add the prometheus jmx agent to AQS Cassandra as Resolved.Mar 29 2018, 11:00 PM

elukey closed subtask T184794: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie as Resolved.Apr 4 2018, 11:25 AM

• Nuria closed subtask T177460: Add the prometheus jmx exporter to all the Zookeeper daemons as Resolved.Apr 12 2018, 10:07 PM

mforns set the point value for this task to 0.May 7 2018, 4:05 PM

elukey closed this task as Resolved.Apr 16 2019, 11:01 AM

Move away from jmxtrans in favor of prometheus jmx_exporterClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move away from jmxtrans in favor of prometheus jmx_exporter
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...