Page MenuHomePhabricator

Prometheus metrics storage for RESTBase dev environment
Closed, ResolvedPublic

Description

With Ops moving in the direction of Prometheus, our transition to a new cluster running Cassandra 3.x is an ideal time to make the cut-over. We should start by enabling the Prometheus JMX exporter in the dev environment, and creating the initial dashboard(s).

  • Enable exporter
  • Disable cassandra-metrics-collector when jmx_exporter is enabled
  • Enable -XX:+PerfDisableSharedMem in /etc/cassandra-{i}/jvm.options (an optimization that breaks cmcd)
  • Build preliminary dashboards

Event Timeline

Eevans created this task.Jul 26 2017, 7:17 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 26 2017, 7:17 PM

Change 367952 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Enable prometheus jmx exporter in dev environment

https://gerrit.wikimedia.org/r/367952

Change 367952 merged by Dzahn:
[operations/puppet@production] restbase: Enable prometheus jmx exporter in dev environment

https://gerrit.wikimedia.org/r/367952

The JMX exporter is now running on all hosts in the dev environment.

Eevans updated the task description. (Show Details)Sep 7 2017, 3:57 PM

Change 377418 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: increase jmx_exporter timeout

https://gerrit.wikimedia.org/r/377418

Change 377418 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: increase jmx_exporter timeout

https://gerrit.wikimedia.org/r/377418

Change 378100 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled

https://gerrit.wikimedia.org/r/378100

A strawman for disabling/uninstalling cassandra-metrics-collector is here (not working). Turns out, doing it properly is a little bit more disruptive than I imagined, the existing code assumes we want cmc everywhere, and configures it via cassandra::metrics in the profile shared by all Cassandra clusters. The approach in r/378100 is to set jmx_exporter_enabled for a cluster, and then use it to disable cmc. I'm not a fan of this approach though; I'd hoped to use Graphite and Prometheus side-by-side in the dev cluster, at least while we build out the new dashboards.

I wonder if the easiest thing wouldn't be to just teach cassandra-metrics-collector how to pause collection (presence of a file, an arg, env var, etc). Presumably once we've tested the exporter, and have some dashboards in place, we'll convert the rest of the production clusters to Prometheus, so we really just need something in the meantime.

A strawman for disabling/uninstalling cassandra-metrics-collector is here (not working). Turns out, doing it properly is a little bit more disruptive than I imagined, the existing code assumes we want cmc everywhere, and configures it via cassandra::metrics in the profile shared by all Cassandra clusters. The approach in r/378100 is to set jmx_exporter_enabled for a cluster, and then use it to disable cmc. I'm not a fan of this approach though; I'd hoped to use Graphite and Prometheus side-by-side in the dev cluster, at least while we build out the new dashboards.

I wonder if the easiest thing wouldn't be to just teach cassandra-metrics-collector how to pause collection (presence of a file, an arg, env var, etc). Presumably once we've tested the exporter, and have some dashboards in place, we'll convert the rest of the production clusters to Prometheus, so we really just need something in the meantime.

Indeed this approach seems the simplest, I agree it makes sense to have graphite and prometheus alongside during the migration

Change 379302 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/cassandra-metrics-collector@master] Update collector to 4.1.0

https://gerrit.wikimedia.org/r/379302

Change 379302 merged by Eevans:
[operations/software/cassandra-metrics-collector@master] Update collector to 4.1.0

https://gerrit.wikimedia.org/r/379302

Change 379305 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/379305

Change 379305 merged by Dzahn:
[operations/puppet@production] cassandra: Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/379305

Eevans updated the task description. (Show Details)Sep 20 2017, 9:37 PM
Eevans updated the task description. (Show Details)

v4.1.0 of cassandra-metrics-collector is deployed to nodes running 3.x, and the restbase-ng nodes are paused (cmcd is running, but it is not collecting metrics). I found the services re-enabled before getting this pushed out, so I suspect that there are metrics in graphite that need cleaning up.

v4.1.0 of cassandra-metrics-collector is deployed to nodes running 3.x, and the restbase-ng nodes are paused (cmcd is running, but it is not collecting metrics). I found the services re-enabled before getting this pushed out, so I suspect that there are metrics in graphite that need cleaning up.

I enabled it yesterday on prod_ng nodes given that we are starting to use Cass 3 in production and need metrics for it. I will do so again today (@fgiunchedi OKed it).

Eevans added a comment.EditedSep 21 2017, 2:01 PM

v4.1.0 of cassandra-metrics-collector is deployed to nodes running 3.x, and the restbase-ng nodes are paused (cmcd is running, but it is not collecting metrics). I found the services re-enabled before getting this pushed out, so I suspect that there are metrics in graphite that need cleaning up.

I enabled it yesterday on prod_ng nodes given that we are starting to use Cass 3 in production and need metrics for it. I will do so again today (@fgiunchedi OKed it).

OK, yeah, I missed that at the time; Re-enabling is just a matter of rm'ing /etc/cassandra-metrics-collector/disable, (which I can do if you haven't already).

EDIT: Looks like you did.

Mentioned in SAL (#wikimedia-operations) [2017-09-21T18:41:58Z] <urandom> T171772: Restarting Cassandra restbase-dev1004-a to apply locally hacked Prometheus exporter config

Mentioned in SAL (#wikimedia-operations) [2017-09-21T18:56:36Z] <urandom> T171772: Applying locally hacked Prometheus exporter config to RESTBase dev Cassandra instances

Change 379610 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Configure agent to export Cassandra histogram metrics

https://gerrit.wikimedia.org/r/379610

Change 379610 merged by Filippo Giunchedi:
[operations/puppet@production] Configure agent to export Cassandra histogram metrics

https://gerrit.wikimedia.org/r/379610

Change 380515 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Follow links (copy the link destination)

https://gerrit.wikimedia.org/r/380515

Change 380515 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: Follow links (copy the link destination)

https://gerrit.wikimedia.org/r/380515

Eevans closed this task as Resolved.Oct 2 2017, 3:46 PM
Eevans updated the task description. (Show Details)

Preliminary dashboards here and here; Closing as done.