Page MenuHomePhabricator

alternative Cassandra metrics reporting
Closed, ResolvedPublic

Description

Upgrading to Cassandra 2.1.7 will leave us entirely without Cassandra-based metrics. Alternatives (both short and long term) need to be investigated, (and if applicable, implemented).

Event Timeline

Eevans assigned this task to fgiunchedi.
Eevans raised the priority of this task from to Medium.
Eevans updated the task description. (Show Details)
Eevans added subscribers: Matanya, Aklapper, GWicke and 2 others.

One option, would be to implement a JMX-based collector that writes to Graphite, in Java. I believe such a collector could be written in an afternoon (maybe two). The advantages of this approach are:

  • 100% compatibility with existing graphs, thresholds, etc
  • Uses Cassandra's JMX agent (the mostly heavily tested, reliable, and supported mechanism)
  • Minimizes complexity (at least when compared to the Jolokia/Diamond approach)

One option, would be to implement a JMX-based collector that writes to Graphite, in Java. I believe such a collector could be written in an afternoon (maybe two). The advantages of this approach are:

  • 100% compatibility with existing graphs, thresholds, etc
  • Uses Cassandra's JMX agent (the mostly heavily tested, reliable, and supported mechanism)
  • Minimizes complexity (at least when compared to the Jolokia/Diamond approach)

I banged something together in this vein a couple of days ago, and we've been using it on an ad-hoc basis when the usual metrics reporting fails.

It is very simple/crude, (pretty much the minimum viable product). Each invocation collects metrics from Cassandra's JMX interface, and writes them out to Graphite using metric names compatible with the integrated reporter (i.e. compatible with current graphs/thresholds). I have been running it in a screen session as-needed, in a loop with a 60 second sleep, but cron would be a better option. It does not seem to hurt to have both this, and the integrated reporter running simultaneously.

Longer term, this could be made into a daemon with a scheduler, either as something that continues to run on the node, or centrally where a single instance can collect from each of the nodes.

My github repo is here:

https://github.com/eevans/cassandra-metrics-collector

A dependency-bundled jar is here:

This is currently running on restbase100{1,2,6}.eqiad in infinite loops, on 60 second intervals, in screen sessions under my user.

This is currently running on restbase100{1,2,6}.eqiad in infinite loops, on 60 second intervals, in screen sessions under my user.

After more failures of the integrated graphite reporter, this is now running on all 6 nodes, again from a shell loop in a screen session, as my user.

I believe @fgiunchedi is planning to properly deploy this, if that happens before I return on July 7, feel free to hijack my screen sessions and kill the processes running there.

@fgiunchedi with this changeset, a mvn deploy will push to the archiva snapshot repo, (current bulid already there: https://archiva.wikimedia.org/#artifact/org.wikimedia/cassandra-metrics-collector/1.0.0-SNAPSHOT). Hope that helps.

Change 223041 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: alternative metrics collector

https://gerrit.wikimedia.org/r/223041

thanks @Eevans! see related code review https://gerrit.wikimedia.org/r/#/c/223041/ I think we're missing only importing the jar into the git repo and deploy via trebuchet

Change 223041 merged by Filippo Giunchedi:
cassandra: alternative metrics collector

https://gerrit.wikimedia.org/r/223041

Change 223570 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: use lock file with flock

https://gerrit.wikimedia.org/r/223570

Change 223570 merged by Filippo Giunchedi:
cassandra: use lock file with flock

https://gerrit.wikimedia.org/r/223570

this is complete, metrics are being pushed each minute from cron

Change 223795 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: don't cronspam cassandra-metrics-collector output

https://gerrit.wikimedia.org/r/223795

Change 223795 merged by Filippo Giunchedi:
cassandra: don't cronspam cassandra-metrics-collector output

https://gerrit.wikimedia.org/r/223795