Page MenuHomePhabricator

Use Prometheus for Kafka JMX metrics instead of jmxtrans
Closed, ResolvedPublic8 Estimated Story Points

Description

We use a very old version of jmxtrans, and it needs to be rebuilt for Debian Stretch. We might as well take this opportunity to use Prometheus instead.

Event Timeline

Change 378040 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/prometheus-jmx-exporter@master] Initial debian commit

https://gerrit.wikimedia.org/r/378040

Change 377753 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kafka::jumbo::broker: enable Prometheus JMX monitoring

https://gerrit.wikimedia.org/r/377753

Change 378040 abandoned by Ottomata:
Initial debian commit

https://gerrit.wikimedia.org/r/378040

Change 378037 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/prometheus-jmx-exporter@debian] Initial debian commit

https://gerrit.wikimedia.org/r/378037

Change 378037 merged by Ottomata:
[operations/debs/prometheus-jmx-exporter@debian] Initial debian commit

https://gerrit.wikimedia.org/r/378037

Change 377753 merged by Elukey:
[operations/puppet@production] role::kafka::jumbo::broker: enable Prometheus JMX monitoring

https://gerrit.wikimedia.org/r/377753

Change 378716 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add analytics instance

https://gerrit.wikimedia.org/r/378716

HM, why are we making an 'analytics' prometheus instance for this? kafka-jumbo is not in the Analytics VLAN, nor is it dedicated for Analytics purposes.

Change 379290 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Include jmx_exporter_config to make prometheus query Kafka jmx exporter

https://gerrit.wikimedia.org/r/379290

Dunno if I'm stepping on yall's toes with this, but I couldn't understand why I didn't see any metrics in prometheus, and figured this was why:

https://gerrit.wikimedia.org/r/#/c/379290/

HM, why are we making an 'analytics' prometheus instance for this? kafka-jumbo is not in the Analytics VLAN, nor is it dedicated for Analytics purposes.

The new analytics instance should be related to all the new metrics that will come with the next quarter migration to prometheus, but it does make sense to not include kafka metrics on it. Either we use the regular operations namespace or maybe we can come up with a new instance only for kafka (like we probably do with Cassandra?).

@fgiunchedi what do you think?

Dunno if I'm stepping on yall's toes with this, but I couldn't understand why I didn't see any metrics in prometheus, and figured this was why:

https://gerrit.wikimedia.org/r/#/c/379290/

+1, let's first decide the final naming for metrics (I saw some comments on the related code review) and then we'll start polling them from the master.

Change 379720 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: add the cluster label to the prometheus metrics

https://gerrit.wikimedia.org/r/379720

Change 379720 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: add the cluster label to the prometheus metrics

https://gerrit.wikimedia.org/r/379720

# elukey@kafka-jumbo1001:~$ curl http://10.64.0.175:7800/metrics -s | grep -i jumbo

[..]
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Heartbeat",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="ApiVersions",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="DeleteTopics",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="OffsetFetch",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="JoinGroup",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="LeaderAndIsr",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="OffsetCommit",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="UpdateMetadata",} 1.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="AddPartitionsToTxn",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="FetchConsumer",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="LeaveGroup",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="EndTxn",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Fetch",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="FindCoordinator",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Produce",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="AlterConfigs",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="DeleteAcls",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="SyncGroup",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="ControlledShutdown",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="DescribeConfigs",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Offsets",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="FetchFollower",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="AddOffsetsToTxn",} 0.0
[..]

Change 379734 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: remove graphite metrics config

https://gerrit.wikimedia.org/r/379734

Change 379734 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: remove graphite metrics config

https://gerrit.wikimedia.org/r/379734

HM, why are we making an 'analytics' prometheus instance for this? kafka-jumbo is not in the Analytics VLAN, nor is it dedicated for Analytics purposes.

The new analytics instance should be related to all the new metrics that will come with the next quarter migration to prometheus, but it does make sense to not include kafka metrics on it. Either we use the regular operations namespace or maybe we can come up with a new instance only for kafka (like we probably do with Cassandra?).

@fgiunchedi what do you think?

Yeah the idea is to have dedicated Prometheus instances roughly per-team, in this case "analytics" to collect e.g. hadoop, kafka, etc metrics in it. When there are useful aggregated metrics we can collect them in the global prometheus instance too.

# elukey@kafka-jumbo1001:~$ curl http://10.64.0.175:7800/metrics -s | grep -i jumbo

[..]
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Heartbeat",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="ApiVersions",} 0.0

See my comments on https://gerrit.wikimedia.org/r/#/c/377753/ re: cluster usage

Metrics names look good overall! I found some that could be turned into key/values but I shouldn't be a blocker.

The delayedoperation below could be moved to sth like operation=deleterecords, fetch, etc

kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_deleterecords{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_fetch{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_heartbeat{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_produce{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_rebalance{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_topic{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_txn_marker_purgatory{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_deleterecords{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_fetch{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_heartbeat{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_produce{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_rebalance{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_topic{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_txn_marker_purgatory{cluster="jumbo",} 0.0

Change 380509 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker_prometheus_exp: update delayed op metric

https://gerrit.wikimedia.org/r/380509

Tested the patch in labs:

kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="DeleteRecords",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Fetch",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Heartbeat",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Produce",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Rebalance",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="topic",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="txn-marker-purgatory",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="DeleteRecords",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Fetch",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Heartbeat",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Produce",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Rebalance",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="topic",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="txn-marker-purgatory",} 0.0

I am a bit unsure about the delayedoperation txn-marker-purgatory and topic, I'll investigate their meaning.

Change 380509 merged by Elukey:
[operations/puppet@production] profile::kafka::broker_prometheus_exp: update delayed op metric

https://gerrit.wikimedia.org/r/380509

Yeah the idea is to have dedicated Prometheus instances roughly per-team, in this case "analytics" to collect e.g. hadoop, kafka, etc metrics in it. When there are useful aggregated metrics we can collect them in the global prometheus instance too.

Me @Ottomata and @elukey chatted a bit about where to put kafka metrics yesterday on IRC. My takeaway is that the metrics for all clusters should live in ops since there is a desire to not associate kafka with analytics and treat it as a shared responsibility.

Change 380744 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::ops: add kafka metrics

https://gerrit.wikimedia.org/r/380744

Change 380763 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kafka::jumbo::broker: rename cluster hiera variable

https://gerrit.wikimedia.org/r/380763

Change 380763 merged by Elukey:
[operations/puppet@production] role::kafka::jumbo::broker: rename cluster hiera variable

https://gerrit.wikimedia.org/r/380763

Change 380744 merged by Elukey:
[operations/puppet@production] role::prometheus::ops: add kafka metrics

https://gerrit.wikimedia.org/r/380744

Change 381177 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kafka::jumbo::broker: allow ganglia configuration

https://gerrit.wikimedia.org/r/381177

Change 381177 merged by Elukey:
[operations/puppet@production] role::kafka::jumbo::broker: allow ganglia configuration

https://gerrit.wikimedia.org/r/381177

Change 381178 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add kafka-jumbo to the list of Ganglia clusters

https://gerrit.wikimedia.org/r/381178

Change 381178 merged by Elukey:
[operations/puppet@production] Add kafka-jumbo to the list of Ganglia clusters

https://gerrit.wikimedia.org/r/381178

Change 381412 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: allow prometheus masters for port 7800

https://gerrit.wikimedia.org/r/381412

Change 381412 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: allow prometheus masters for port 7800

https://gerrit.wikimedia.org/r/381412

elukey moved this task from In Progress to Done on the Analytics-Kanban board.

I just verified that all metrics that we had in the Kafka dashboard are currently showed by the new prometheus only dashboard. There is currently an ongoing discussion on metric naming etc.. but the purpose of this task is met.

Change 379290 abandoned by Ottomata:
Include jmx_exporter_config to make prometheus query Kafka jmx exporter

https://gerrit.wikimedia.org/r/379290

Change 378716 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add analytics instance

https://gerrit.wikimedia.org/r/378716