Prometheus would provide a number of benefits for us over Graphite, most significantly, a more straightforward way to share dashboards across clusters (the way Prometheus is modeled, the cluster is an attribute of metrics that can be easily templated in Grafana).
== Plan A ==
The easiest way to set this up would seem to be [[ https://github.com/prometheus/jmx_exporter | jmx_exporter ]], a JVM agent that spins up its own in-process HTTP server to export metrics. I have (lightly) tested this in deployment-prep, and it seems to work well. One added benefit of this approach would be that we could eliminate [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/cassandra-metrics-collector | cassandra-metrics-collector ]] (one less application to maintain, and one less moving part on each host).
Next steps:
- [x] Fork https://github.com/prometheus/jmx_exporter to the Wikimedia account, and tag a release
- [x] Upload a build to Archiva
- [x] Get a deployment repository setup
- [x] Puppetize the loading of the agent, and Ferm rules
- [x] Push to Staging and deployment-prep for further evaluation
- [ ] Evaluate impact of [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/cassandra-metrics-collector | cmcd ]] collection in isolation
See also: https://github.com/prometheus/jmx_exporter/issues/113
== Plan B ==
Another approach would be to add Prometheus support to [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/cassandra-metrics-collector | cassandra-metrics-collector ]], allowing it to simultaneously support both the transport of metrics to Graphite, and the export of Prometheus metrics via HTTP. This would provide two main benefits: a) It would run outside of the Cassandra process, (potentially saving some GC pressure), and b) it could export a copy of the metrics cached from the last Graphite collection, thus saving Cassandra from the load of any additional polling.
While the amount of development effort needed to add Prometheus support would be quite small, this further entrenches a piece of software that we are required to maintain, and so should be used in the event Plan A is unsuccessful.