Upgrading to Cassandra 2.1.7 will leave us entirely without Cassandra-based metrics. Alternatives (both short and long term) need to be investigated, (and if applicable, implemented).
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Invalid | None | T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies | |||
| Resolved | RobH | T93790 Expand RESTBase cluster capacity | |||
| Resolved | MoritzMuehlenhoff | T100773 Switch to Linux 3.19 by default on jessie hosts | |||
| Resolved | MoritzMuehlenhoff | T102234 Update restbase100[1-6] to the 3.19 kernel | |||
| Resolved | fgiunchedi | T102015 put new restbase servers in service | |||
| Resolved | fgiunchedi | T104208 alternative Cassandra metrics reporting |
Event Timeline
One option, would be to implement a JMX-based collector that writes to Graphite, in Java. I believe such a collector could be written in an afternoon (maybe two). The advantages of this approach are:
- 100% compatibility with existing graphs, thresholds, etc
- Uses Cassandra's JMX agent (the mostly heavily tested, reliable, and supported mechanism)
- Minimizes complexity (at least when compared to the Jolokia/Diamond approach)
I banged something together in this vein a couple of days ago, and we've been using it on an ad-hoc basis when the usual metrics reporting fails.
It is very simple/crude, (pretty much the minimum viable product). Each invocation collects metrics from Cassandra's JMX interface, and writes them out to Graphite using metric names compatible with the integrated reporter (i.e. compatible with current graphs/thresholds). I have been running it in a screen session as-needed, in a loop with a 60 second sleep, but cron would be a better option. It does not seem to hurt to have both this, and the integrated reporter running simultaneously.
Longer term, this could be made into a daemon with a scheduler, either as something that continues to run on the node, or centrally where a single instance can collect from each of the nodes.
My github repo is here:
https://github.com/eevans/cassandra-metrics-collector
A dependency-bundled jar is here:
This is currently running on restbase100{1,2,6}.eqiad in infinite loops, on 60 second intervals, in screen sessions under my user.
After more failures of the integrated graphite reporter, this is now running on all 6 nodes, again from a shell loop in a screen session, as my user.
I believe @fgiunchedi is planning to properly deploy this, if that happens before I return on July 7, feel free to hijack my screen sessions and kill the processes running there.
@fgiunchedi with this changeset, a mvn deploy will push to the archiva snapshot repo, (current bulid already there: https://archiva.wikimedia.org/#artifact/org.wikimedia/cassandra-metrics-collector/1.0.0-SNAPSHOT). Hope that helps.
Change 223041 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: alternative metrics collector
thanks @Eevans! see related code review https://gerrit.wikimedia.org/r/#/c/223041/ I think we're missing only importing the jar into the git repo and deploy via trebuchet
Change 223570 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: use lock file with flock
Change 223795 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: don't cronspam cassandra-metrics-collector output
Change 223795 merged by Filippo Giunchedi:
cassandra: don't cronspam cassandra-metrics-collector output