Enable Prometheus metrics export for Cassandra
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Jan 11 2017, 8:21 PM

Description

Prometheus would provide a number of benefits for us over Graphite, most significantly, a more straightforward way to share dashboards across clusters (the way Prometheus is modeled, the cluster is an attribute of metrics that can be easily templated in Grafana).

Plan A

The easiest way to set this up would seem to be jmx_exporter, a JVM agent that spins up its own in-process HTTP server to export metrics. I have (lightly) tested this in deployment-prep, and it seems to work well. One added benefit of this approach would be that we could eliminate cassandra-metrics-collector (one less application to maintain, and one less moving part on each host).

Next steps:

Fork https://github.com/prometheus/jmx_exporter to the Wikimedia account, and tag a release
Upload a build to Archiva
Get a deployment repository setup
Puppetize the loading of the agent, and Ferm rules
Push to Staging and deployment-prep for further evaluation
Evaluate impact of cmcd collection in isolation

Plan B

Another approach would be to add Prometheus support to cassandra-metrics-collector, allowing it to simultaneously support both the transport of metrics to Graphite, and the export of Prometheus metrics via HTTP. This would provide two main benefits: a) It would run outside of the Cassandra process, (potentially saving some GC pressure), and b) it could export a copy of the metrics cached from the last Graphite collection, thus saving Cassandra from the load of any additional polling.

While the amount of development effort needed to add Prometheus support would be quite small, this further entrenches a piece of software that we are required to maintain, and so should be used in the event Plan A is unsuccessful.

Details

Subject	Repo	Branch	Lines +/-
Update Cassandra jmx_exporter config path in deployment-prep	operations/puppet	production	+1 -1
Update prometheus jmx_exporter path in deployment-prep	operations/puppet	production	+1 -1
Revert "Enable Prometheus exporter on restbase1007 (canary)"	operations/puppet	production	+0 -3
Enable Prometheus exporter on restbase1007 (canary)	operations/puppet	production	+3 -0
Fix broken path to Prometheus exporter config	operations/puppet	production	+1 -1
Enable JMX exporter on RESTBase Staging nodes in eqiad	operations/puppet	production	+3 -0
Update path of exporter jar to currently deployed version	operations/puppet	production	+1 -1
Enable Prometheus JMX exporter on Cassandra nodes	operations/puppet	production	+112 -38
Prometheus JMX exporter deploy repository	operations/software/prometheus_jmx_exporter	master	+23 -0
fix incorrect port in ferm rule	operations/puppet	production	+1 -1
cassandra: add jmx_exporter to Cassandra in deployment-prep	operations/puppet	production	+13 -0

Related Objects

Mentioned In: rOSPJ23a8f0b844ac: Prometheus JMX exporter deploy repository
rOSPJ712cfc588e33: Prometheus JMX exporter deploy repository
rOSPJ0906951d6fc0: Prometheus JMX exporter deploy repository
Mentioned Here: T164093: Increased cassandra-metrics-collector utilization w/ Cassandra 3.x

Event Timeline

Eevans created this task.Jan 11 2017, 8:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2017, 8:21 PM

Eevans triaged this task as Low priority.Jan 11 2017, 8:22 PM

Eevans added a subscriber: fgiunchedi.

Change 331911 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add jmx_exporter to Cassandra in deployment-prep

https://gerrit.wikimedia.org/r/331911

gerritbot added a project: Patch-For-Review.Jan 12 2017, 9:09 PM

A request for a deployment repository (operations/software/prometheus_jmx_exporter) has been submitted: https://www.mediawiki.org/wiki/Git/New_repositories/Requests

https://github.com/wikimedia/jmx_exporter has been created

Change 331911 merged by Filippo Giunchedi:
cassandra: add jmx_exporter to Cassandra in deployment-prep

https://gerrit.wikimedia.org/r/331911

Change 332535 had a related patch set uploaded (by Eevans):
WIP: Enable Prometheus JMX exporter on Cassandra nodes

https://gerrit.wikimedia.org/r/332535

Change 332542 had a related patch set uploaded (by Eevans):
Prometheus JMX exporter deploy repository

https://gerrit.wikimedia.org/r/332542

Eevans mentioned this in rOSPJ0906951d6fc0: Prometheus JMX exporter deploy repository.Jan 17 2017, 7:28 PM

Volans added a subscriber: Gehel.Jan 17 2017, 7:35 PM

Eevans mentioned this in rOSPJ712cfc588e33: Prometheus JMX exporter deploy repository.Jan 17 2017, 9:06 PM

Some docs here: https://wikitech.wikimedia.org/wiki/Cassandra/PrometheusJmxExporter.

Eevans updated the task description. (Show Details)Jan 17 2017, 10:12 PM

Change 332682 had a related patch set uploaded (by Eevans):
fix incorrect port in ferm rule

https://gerrit.wikimedia.org/r/332682

Change 332682 merged by Filippo Giunchedi:
fix incorrect port in ferm rule

https://gerrit.wikimedia.org/r/332682

Eevans mentioned this in rOSPJ23a8f0b844ac: Prometheus JMX exporter deploy repository.Jan 18 2017, 6:26 PM

Change 332542 merged by Eevans:
Prometheus JMX exporter deploy repository

https://gerrit.wikimedia.org/r/332542

Change 332535 merged by Filippo Giunchedi:
Enable Prometheus JMX exporter on Cassandra nodes

https://gerrit.wikimedia.org/r/332535

Eevans updated the task description. (Show Details)Feb 2 2017, 8:31 PM

Change 335826 had a related patch set uploaded (by Eevans):
Enable JMX exporter on RESTBase Staging nodes in eqiad

https://gerrit.wikimedia.org/r/335826

Change 336831 had a related patch set uploaded (by Eevans):
Update path of exporter jar to currently deployed version

https://gerrit.wikimedia.org/r/336831

Change 336831 merged by Filippo Giunchedi:
Update path of exporter jar to currently deployed version

https://gerrit.wikimedia.org/r/336831

Change 335826 merged by Filippo Giunchedi:
Enable JMX exporter on RESTBase Staging nodes in eqiad

https://gerrit.wikimedia.org/r/335826

Eevans updated the task description. (Show Details)Feb 10 2017, 3:13 PM

Change 337034 had a related patch set uploaded (by Eevans):
Fix broken path to Prometheus exporter config

https://gerrit.wikimedia.org/r/337034

Change 337034 merged by Filippo Giunchedi:
Fix broken path to Prometheus exporter config

https://gerrit.wikimedia.org/r/337034

Change 337493 had a related patch set uploaded (by Eevans):
Enable Prometheus exporter on restbase1007 (canary)

https://gerrit.wikimedia.org/r/337493

Change 337493 merged by Filippo Giunchedi:
Enable Prometheus exporter on restbase1007 (canary)

https://gerrit.wikimedia.org/r/337493

Eevans moved this task from Backlog to In-Progress on the Cassandra board.Feb 15 2017, 4:15 PM

Mentioned in SAL (#wikimedia-operations) [2017-02-15T16:16:35Z] <urandom> T155120: restarting Cassandra on restbase1007-a to enable Prometheus exporter (canary)

This is now deployed to restbase1007, and the 1007-a instance has been restarted to serve as a canary. I have the following running from two screen sessions (to approximate having two Prometheus collectors polling the agent):

while true; do curl http://10.64.0.202:7800/metrics 2>/dev/null && (echo; sleep `shuf -i 45-60 -n 1`; echo 'times up!!'); done

Unfortunately, this seems to result in a non-trivial increase in GC collection time: https://grafana.wikimedia.org/dashboard/snapshot/7HJrqkejCweP2x5WE76JFL0cKj1footD

Change 338010 had a related patch set uploaded (by Eevans):
Revert "Enable Prometheus exporter on restbase1007 (canary)"

https://gerrit.wikimedia.org/r/338010

Change 338010 merged by Filippo Giunchedi:
Revert "Enable Prometheus exporter on restbase1007 (canary)"

https://gerrit.wikimedia.org/r/338010

Mentioned in SAL (#wikimedia-operations) [2017-02-17T15:26:08Z] <urandom> T155120: Restarting Cassandra on restbase1007-a.eqiad.wmnet to disable Prometheus exporter agent

Submitted upstream issue: https://github.com/prometheus/jmx_exporter/issues/113

Eevans updated the task description. (Show Details)Feb 17 2017, 4:06 PM

The plan here was to enable the exporter alongside the existing metrics collection, and slowly (incrementally) transition to it. Once we had production-ready scraping and storage, and dashboards in place, we could consider deprecating our graphite metrics. On the single node tested, I did not observe any noticeable increase in utilization or latency, but I'm not sure I'm comfortable moving forward like this knowing that it's adding GC pressure.

One thing worth considering is that what we're seeing here is simply the cost of running through all of these MBeans, and serializing the results. Before abandoning this approach, it might be worth testing the impact of our existing JMX metrics collection in isolation. If the impact is similar, then we could weigh the option of moving forward, perhaps with a better coordinated and more aggressive timeline for the deprecation of cmcd collection. In other words, if the replacement of cmcd by the Prometheus exporter is a net-zero change, than perhaps we could live with the higher GC pressure for the time it took to migrate (and we could always considering lowering the scrape frequency during this period).

Eevans updated the task description. (Show Details)Feb 24 2017, 4:41 PM

Eevans moved this task from In-Progress to Blocked on the Cassandra board.Feb 27 2017, 4:46 PM

Change 342825 had a related patch set uploaded (by Elukey):
[operations/puppet] Update prometheus jmx_exporter path in deployment-prep

https://gerrit.wikimedia.org/r/342825

Change 342825 merged by Elukey:
[operations/puppet] Update prometheus jmx_exporter path in deployment-prep

https://gerrit.wikimedia.org/r/342825

Change 342829 had a related patch set uploaded (by Elukey):
[operations/puppet] Update Cassandra jmx_exporter config path in deployment-prep

https://gerrit.wikimedia.org/r/342829

Change 342829 merged by Elukey:
[operations/puppet] Update Cassandra jmx_exporter config path in deployment-prep

https://gerrit.wikimedia.org/r/342829

Based on the additional overhead observed in T164093, (the result there of collecting against both Table and ColumnFamily), I'm reasonably convinced that there isn't anything out of the ordinary about the overheads observed here; I think this is simply the cost associated with the collection and serialization of all these metrics.

If we are concerned about incurring this cost twice (once for graphite, and again for prometheus), we could consider piggy-backing on the plans to deploy Cassandra 3.x and redesigned storage to a separate cluster, and make the transition to prometheus as part of that rollout. This would allow Ops the opportunity to gradually expand prometheus capacity as well.

@Eevans, is there anything left to do here?

• GWicke moved this task from Backlog to doing on the Services board.Jul 11 2017, 8:18 PM

• GWicke edited projects, added Services (doing); removed Services.

• GWicke moved this task from doing to done on the Services board.Jul 12 2017, 7:53 PM

• GWicke edited projects, added Services (done); removed Services (doing).

In T155120#3426860, @GWicke wrote:

@Eevans, is there anything left to do here?

I think we can call it done.

Enable Prometheus metrics export for CassandraClosed, ResolvedPublicActions