Move away from jmxtrans in favor of prometheus jmx_exporter
Open, Needs TriagePublic0 Story Points

Description

We are currently using jmxtrans deployed on most of our nodes to export jmx metrics to graphite, but the current state of our configuration is not great since:

  1. We are far behind from upstream, so some work would be needed to update jmxtrans to a more up to date version. Last issue was https://gerrit.wikimedia.org/r/#/c/376663 with the Kafka Jumbo cluster.
  1. Ops is considering Prometheus the way to go for the future, so it might make sense to plan/move to it sooner rather than later.

The prometheus jmx exporter is already used in Wikimedia (deployed via scap and puppetized).

elukey created this task.Sep 8 2017, 10:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2017, 10:07 AM
Suhadakashter closed this task as a duplicate of T175367: Page wikipedia.
Reedy reopened this task as Open.Sep 8 2017, 2:12 PM
Nuria added a subscriber: Nuria.Sep 14 2017, 3:46 PM

Cassandra, zookeeper, druid, hadoop, kafka

fdans moved this task from Incoming to Backlog (Later) on the Analytics board.Sep 21 2017, 4:42 PM

Change 386190 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add default_prometheus_jmx_exporter.yaml

https://gerrit.wikimedia.org/r/386190

Ottomata added subscribers: fgiunchedi, Ottomata.EditedOct 24 2017, 2:15 PM

Heya @fgiunchedi, https://gerrit.wikimedia.org/r/#/c/386190/ adds a default jmx exporter config file, which actually lets jmx exporter scrape and transform mbeans without requiring extra configs. It actually does a pretty good job!

Here's what it does for Kafka MirrorMaker:
https://gist.github.com/ottomata/6920c6efbf58483223ad27c5672f3025

@elukey said I should check with you to see if you think the above metrics are ok (labels, etc.) before proceeding. Whatcha think?

@Ottomata the metrics look good generally! A couple of things I noticed:

These for example are the same metric name repeated with or without topic, I'm assuming the one without topic is the sum all topics. For a given metric name there should be always the same set of labels or things get confusing. I think you can either rename the metric with say kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate_topic or drop if you are not interested in per-topic (or conversely not interested in the sum)

kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{client_id="kafka-mirror-k1_to_k2-0",topic="test1",} 4.820083023470853
kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{client_id="kafka-mirror-k1_to_k2-0",} 4.819324507740425

These seem to have a label with potentially high cardinality, do you know how node_id changes over time?

kafka_consumer_consumer_node_metrics_incoming_byte_rate{client_id="kafka-mirror-k1_to_k2-0",node_id="node--1",} 0.0
kafka_consumer_consumer_node_metrics_incoming_byte_rate{client_id="kafka-mirror-k1_to_k2-0",node_id="node-11",} 137.8430494146978
kafka_consumer_consumer_node_metrics_incoming_byte_rate{client_id="kafka-mirror-k1_to_k2-0",node_id="node-2147483636",} 6.765667106243908

re: the default jmx config I think it is better to be explicit rather than implicit, we could provide an out of the box config to be extended but I suspect each software will have to tweak it slightly anyways

These seem to have a label with potentially high cardinality, do you know how node_id changes over time?

Ah, K had to look this up. These are per-broker metrics, so it should remain low. I've seen -1 mean bootstrap metrics, so these will likely always be 0. The 11 is actually the broker.id of the one broker I'm consuming from here. I dunno what 2147483636. But, in any case, there should only be about the same number of node-ids as there are brokers, so, in our case, 6(+2?)! https://docs.confluent.io/current/kafka/monitoring.html#id3

For a given metric name there should be always the same set of labels or things get confusing.

Just curious, in what way do they get confusing?

We can probably just rename the sum on all topics to kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_total_rate.

Change 386190 abandoned by Ottomata:
Add default_prometheus_jmx_exporter.yaml

Reason:
Godog wants explicit oh welllll

https://gerrit.wikimedia.org/r/386190

These seem to have a label with potentially high cardinality, do you know how node_id changes over time?

Ah, K had to look this up. These are per-broker metrics, so it should remain low. I've seen -1 mean bootstrap metrics, so these will likely always be 0. The 11 is actually the broker.id of the one broker I'm consuming from here. I dunno what 2147483636. But, in any case, there should only be about the same number of node-ids as there are brokers, so, in our case, 6(+2?)! https://docs.confluent.io/current/kafka/monitoring.html#id3

Ok! Seems to be manageable indeed.

For a given metric name there should be always the same set of labels or things get confusing.

Just curious, in what way do they get confusing?

In this particular case for example aggregations on the metric (e.g. sum()) will return wrong results (each topic plus the sum of all topics, summed)

We can probably just rename the sum on all topics to kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_total_rate.

Yep that's an option as well

Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 3 2018, 10:48 PM
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.
fdans moved this task from Paused to Parent Tasks on the Analytics-Kanban board.Jan 4 2018, 5:34 PM
Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.
Nuria edited projects, added Analytics; removed Analytics-Kanban.
Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Mar 22 2018, 4:50 PM
mforns set the point value for this task to 0.May 7 2018, 4:05 PM