Enable Prometheus metrics export for Cassandra
Open, LowPublic

Description

Prometheus would provide a number of benefits for us over Graphite, most significantly, a more straightforward way to share dashboards across clusters (the way Prometheus is modeled, the cluster is an attribute of metrics that can be easily templated in Grafana).

The easiest way to set this up would seem to be jmx_exporter, a JVM agent that spins up its own in-process HTTP server to export metrics. I have (lightly) tested this in deployment-prep, and it seems to work well. One added benefit of this approach would be that we could eliminate cassandra-metrics-collector (one less application to maintain, and one less moving part on each host).

Next steps:

  • Fork https://github.com/prometheus/jmx_exporter to the Wikimedia account, and tag a release
  • Upload a build to Archiva
  • Get a deployment repository setup
  • Puppetize the loading of the agent, and Ferm rules
  • Push to Staging and deployment-prep for further evaluation

See also: https://github.com/prometheus/jmx_exporter/issues/113

Eevans created this task.Jan 11 2017, 8:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2017, 8:21 PM
Eevans triaged this task as "Low" priority.Jan 11 2017, 8:22 PM
Eevans added a subscriber: fgiunchedi.

Change 331911 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add jmx_exporter to Cassandra in deployment-prep

https://gerrit.wikimedia.org/r/331911

A request for a deployment repository (operations/software/prometheus_jmx_exporter) has been submitted: https://www.mediawiki.org/wiki/Git/New_repositories/Requests

Change 331911 merged by Filippo Giunchedi:
cassandra: add jmx_exporter to Cassandra in deployment-prep

https://gerrit.wikimedia.org/r/331911

Change 332535 had a related patch set uploaded (by Eevans):
WIP: Enable Prometheus JMX exporter on Cassandra nodes

https://gerrit.wikimedia.org/r/332535

Change 332542 had a related patch set uploaded (by Eevans):
Prometheus JMX exporter deploy repository

https://gerrit.wikimedia.org/r/332542

Volans added a subscriber: Gehel.Jan 17 2017, 7:35 PM
Eevans edited the task description. (Show Details)Jan 17 2017, 10:12 PM

Change 332682 had a related patch set uploaded (by Eevans):
fix incorrect port in ferm rule

https://gerrit.wikimedia.org/r/332682

Change 332682 merged by Filippo Giunchedi:
fix incorrect port in ferm rule

https://gerrit.wikimedia.org/r/332682

Change 332542 merged by Eevans:
Prometheus JMX exporter deploy repository

https://gerrit.wikimedia.org/r/332542

Change 332535 merged by Filippo Giunchedi:
Enable Prometheus JMX exporter on Cassandra nodes

https://gerrit.wikimedia.org/r/332535

Eevans edited the task description. (Show Details)Thu, Feb 2, 8:31 PM

Change 335826 had a related patch set uploaded (by Eevans):
Enable JMX exporter on RESTBase Staging nodes in eqiad

https://gerrit.wikimedia.org/r/335826

Change 336831 had a related patch set uploaded (by Eevans):
Update path of exporter jar to currently deployed version

https://gerrit.wikimedia.org/r/336831

Change 336831 merged by Filippo Giunchedi:
Update path of exporter jar to currently deployed version

https://gerrit.wikimedia.org/r/336831

Change 335826 merged by Filippo Giunchedi:
Enable JMX exporter on RESTBase Staging nodes in eqiad

https://gerrit.wikimedia.org/r/335826

Eevans edited the task description. (Show Details)Fri, Feb 10, 3:13 PM

Change 337034 had a related patch set uploaded (by Eevans):
Fix broken path to Prometheus exporter config

https://gerrit.wikimedia.org/r/337034

Change 337034 merged by Filippo Giunchedi:
Fix broken path to Prometheus exporter config

https://gerrit.wikimedia.org/r/337034

Change 337493 had a related patch set uploaded (by Eevans):
Enable Prometheus exporter on restbase1007 (canary)

https://gerrit.wikimedia.org/r/337493

Change 337493 merged by Filippo Giunchedi:
Enable Prometheus exporter on restbase1007 (canary)

https://gerrit.wikimedia.org/r/337493

Eevans moved this task from Backlog to In-Progress on the Cassandra board.Wed, Feb 15, 4:15 PM

Mentioned in SAL (#wikimedia-operations) [2017-02-15T16:16:35Z] <urandom> T155120: restarting Cassandra on restbase1007-a to enable Prometheus exporter (canary)

This is now deployed to restbase1007, and the 1007-a instance has been restarted to serve as a canary. I have the following running from two screen sessions (to approximate having two Prometheus collectors polling the agent):

while true; do curl http://10.64.0.202:7800/metrics 2>/dev/null && (echo; sleep `shuf -i 45-60 -n 1`; echo 'times up!!'); done

Unfortunately, this seems to result in a non-trivial increase in GC collection time: https://grafana.wikimedia.org/dashboard/snapshot/7HJrqkejCweP2x5WE76JFL0cKj1footD

Change 338010 had a related patch set uploaded (by Eevans):
Revert "Enable Prometheus exporter on restbase1007 (canary)"

https://gerrit.wikimedia.org/r/338010

Change 338010 merged by Filippo Giunchedi:
Revert "Enable Prometheus exporter on restbase1007 (canary)"

https://gerrit.wikimedia.org/r/338010

Mentioned in SAL (#wikimedia-operations) [2017-02-17T15:26:08Z] <urandom> T155120: Restarting Cassandra on restbase1007-a.eqiad.wmnet to disable Prometheus exporter agent

Eevans edited the task description. (Show Details)Fri, Feb 17, 4:06 PM