We use a very old version of jmxtrans, and it needs to be rebuilt for Debian Stretch. We might as well take this opportunity to use Prometheus instead.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Declined | elukey | T166833 Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field | |||
| Resolved | Ottomata | T152015 Provision new Kafka cluster(s) with security features | |||
| Resolved | None | T175344 Move away from jmxtrans in favor of prometheus jmx_exporter | |||
| Resolved | elukey | T175922 Use Prometheus for Kafka JMX metrics instead of jmxtrans | |||
| Resolved | Ottomata | T175923 Port Kafka alerts from check_graphite to check_prometheus | |||
| Resolved | elukey | T177078 Decide on casing convention for JMX metrics in Prometheus |
Event Timeline
Change 378040 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/prometheus-jmx-exporter@master] Initial debian commit
Change 377753 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kafka::jumbo::broker: enable Prometheus JMX monitoring
Change 378037 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/prometheus-jmx-exporter@debian] Initial debian commit
Change 378037 merged by Ottomata:
[operations/debs/prometheus-jmx-exporter@debian] Initial debian commit
Change 377753 merged by Elukey:
[operations/puppet@production] role::kafka::jumbo::broker: enable Prometheus JMX monitoring
Change 378716 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add analytics instance
HM, why are we making an 'analytics' prometheus instance for this? kafka-jumbo is not in the Analytics VLAN, nor is it dedicated for Analytics purposes.
Change 379290 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Include jmx_exporter_config to make prometheus query Kafka jmx exporter
Dunno if I'm stepping on yall's toes with this, but I couldn't understand why I didn't see any metrics in prometheus, and figured this was why:
The new analytics instance should be related to all the new metrics that will come with the next quarter migration to prometheus, but it does make sense to not include kafka metrics on it. Either we use the regular operations namespace or maybe we can come up with a new instance only for kafka (like we probably do with Cassandra?).
@fgiunchedi what do you think?
+1, let's first decide the final naming for metrics (I saw some comments on the related code review) and then we'll start polling them from the master.
Change 379720 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: add the cluster label to the prometheus metrics
Change 379720 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: add the cluster label to the prometheus metrics
# elukey@kafka-jumbo1001:~$ curl http://10.64.0.175:7800/metrics -s | grep -i jumbo
[..]
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Heartbeat",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="ApiVersions",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="DeleteTopics",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="OffsetFetch",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="JoinGroup",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="LeaderAndIsr",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="OffsetCommit",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="UpdateMetadata",} 1.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="AddPartitionsToTxn",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="FetchConsumer",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="LeaveGroup",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="EndTxn",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Fetch",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="FindCoordinator",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Produce",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="AlterConfigs",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="DeleteAcls",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="SyncGroup",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="ControlledShutdown",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="DescribeConfigs",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="Offsets",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="FetchFollower",} 0.0
kafka_network_requestmetrics_requestqueuetimems{cluster="jumbo",request="AddOffsetsToTxn",} 0.0
[..]Change 379734 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: remove graphite metrics config
Change 379734 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: remove graphite metrics config
Yeah the idea is to have dedicated Prometheus instances roughly per-team, in this case "analytics" to collect e.g. hadoop, kafka, etc metrics in it. When there are useful aggregated metrics we can collect them in the global prometheus instance too.
See my comments on https://gerrit.wikimedia.org/r/#/c/377753/ re: cluster usage
Metrics names look good overall! I found some that could be turned into key/values but I shouldn't be a blocker.
The delayedoperation below could be moved to sth like operation=deleterecords, fetch, etc
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_deleterecords{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_fetch{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_heartbeat{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_produce{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_rebalance{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_topic{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations_delayedoperation_txn_marker_purgatory{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_deleterecords{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_fetch{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_heartbeat{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_produce{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_rebalance{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_topic{cluster="jumbo",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_txn_marker_purgatory{cluster="jumbo",} 0.0Change 380509 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker_prometheus_exp: update delayed op metric
Tested the patch in labs:
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="DeleteRecords",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Fetch",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Heartbeat",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Produce",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="Rebalance",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="topic",} 0.0
kafka_server_delayedoperationpurgatory_numdelayedoperations{delayedoperation="txn-marker-purgatory",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="DeleteRecords",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Fetch",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Heartbeat",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Produce",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="Rebalance",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="topic",} 0.0
kafka_server_delayedoperationpurgatory_purgatorysize{delayedoperation="txn-marker-purgatory",} 0.0I am a bit unsure about the delayedoperation txn-marker-purgatory and topic, I'll investigate their meaning.
Change 380509 merged by Elukey:
[operations/puppet@production] profile::kafka::broker_prometheus_exp: update delayed op metric
Change 380744 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::ops: add kafka metrics
Change 380763 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kafka::jumbo::broker: rename cluster hiera variable
Change 380763 merged by Elukey:
[operations/puppet@production] role::kafka::jumbo::broker: rename cluster hiera variable
Change 380744 merged by Elukey:
[operations/puppet@production] role::prometheus::ops: add kafka metrics
Change 381177 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::kafka::jumbo::broker: allow ganglia configuration
Change 381177 merged by Elukey:
[operations/puppet@production] role::kafka::jumbo::broker: allow ganglia configuration
Change 381178 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add kafka-jumbo to the list of Ganglia clusters
Change 381178 merged by Elukey:
[operations/puppet@production] Add kafka-jumbo to the list of Ganglia clusters
Change 381412 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: allow prometheus masters for port 7800
Change 381412 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: allow prometheus masters for port 7800
I just verified that all metrics that we had in the Kafka dashboard are currently showed by the new prometheus only dashboard. There is currently an ongoing discussion on metric naming etc.. but the purpose of this task is met.
Change 379290 abandoned by Ottomata:
Include jmx_exporter_config to make prometheus query Kafka jmx exporter
Change 378716 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add analytics instance