Prometheus metrics storage for RESTBase dev environment
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Jul 26 2017, 7:17 PM

Description

With Ops moving in the direction of Prometheus, our transition to a new cluster running Cassandra 3.x is an ideal time to make the cut-over. We should start by enabling the Prometheus JMX exporter in the dev environment, and creating the initial dashboard(s).

Enable exporter
Disable cassandra-metrics-collector when jmx_exporter is enabled
Enable -XX:+PerfDisableSharedMem in /etc/cassandra-{i}/jvm.options (an optimization that breaks cmcd)
Build preliminary dashboards

Details

Subject	Repo	Branch	Lines +/-
cassandra: Follow links (copy the link destination)	operations/puppet	production	+1 -0
Configure agent to export Cassandra histogram metrics	operations/puppet	production	+25 -2
cassandra: Link-in upgraded cassandra-metrics-collector jar	operations/puppet	production	+1 -1
Update collector to 4.1.0	operations/software/cassandra-metrics-collector	master	+1 -0
WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled	operations/puppet	production	+26 -29
prometheus: increase jmx_exporter timeout	operations/puppet	production	+1 -0
restbase: Enable prometheus jmx exporter in dev environment	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T160570 Cassandra 3.x Tracking
Resolved	Eevans	T169936 Services 2017/18 Q1 goal: Start gradual roll-out of Cassandra 3 & new schema to resolve storage scaling issues and OOM errors.
Resolved	Eevans	T171772 Prometheus metrics storage for RESTBase dev environment
Resolved	fgiunchedi	T173490 Provision prometheus instance for cassandra/services metrics collection

Event Timeline

Eevans created this task.Jul 26 2017, 7:17 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 26 2017, 7:17 PM

Eevans removed a project: Wikimedia-Incident.Jul 26 2017, 7:17 PM

Change 367952 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Enable prometheus jmx exporter in dev environment

https://gerrit.wikimedia.org/r/367952

gerritbot added a project: Patch-For-Review.Jul 26 2017, 7:39 PM

Change 367952 merged by Dzahn:
[operations/puppet@production] restbase: Enable prometheus jmx exporter in dev environment

https://gerrit.wikimedia.org/r/367952

The JMX exporter is now running on all hosts in the dev environment.

fgiunchedi created subtask T173490: Provision prometheus instance for cassandra/services metrics collection.Aug 17 2017, 9:48 AM

fgiunchedi closed subtask T173490: Provision prometheus instance for cassandra/services metrics collection as Resolved.Sep 7 2017, 9:43 AM

Eevans updated the task description. (Show Details)Sep 7 2017, 3:25 PM

Eevans added a parent task: T169936: Services 2017/18 Q1 goal: Start gradual roll-out of Cassandra 3 & new schema to resolve storage scaling issues and OOM errors..

Eevans mentioned this in T169939: End of August milestone: Cassandra 3 cluster in production.Sep 7 2017, 3:35 PM

Eevans updated the task description. (Show Details)Sep 7 2017, 3:57 PM

Change 377418 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: increase jmx_exporter timeout

https://gerrit.wikimedia.org/r/377418

Change 377418 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: increase jmx_exporter timeout

https://gerrit.wikimedia.org/r/377418

Change 378100 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled

https://gerrit.wikimedia.org/r/378100

A strawman for disabling/uninstalling cassandra-metrics-collector is here (not working). Turns out, doing it properly is a little bit more disruptive than I imagined, the existing code assumes we want cmc everywhere, and configures it via cassandra::metrics in the profile shared by all Cassandra clusters. The approach in r/378100 is to set jmx_exporter_enabled for a cluster, and then use it to disable cmc. I'm not a fan of this approach though; I'd hoped to use Graphite and Prometheus side-by-side in the dev cluster, at least while we build out the new dashboards.

I wonder if the easiest thing wouldn't be to just teach cassandra-metrics-collector how to pause collection (presence of a file, an arg, env var, etc). Presumably once we've tested the exporter, and have some dashboards in place, we'll convert the rest of the production clusters to Prometheus, so we really just need something in the meantime.

In T171772#3609319, @Eevans wrote:

A strawman for disabling/uninstalling cassandra-metrics-collector is here (not working). Turns out, doing it properly is a little bit more disruptive than I imagined, the existing code assumes we want cmc everywhere, and configures it via cassandra::metrics in the profile shared by all Cassandra clusters. The approach in r/378100 is to set jmx_exporter_enabled for a cluster, and then use it to disable cmc. I'm not a fan of this approach though; I'd hoped to use Graphite and Prometheus side-by-side in the dev cluster, at least while we build out the new dashboards.

I wonder if the easiest thing wouldn't be to just teach cassandra-metrics-collector how to pause collection (presence of a file, an arg, env var, etc). Presumably once we've tested the exporter, and have some dashboards in place, we'll convert the rest of the production clusters to Prometheus, so we really just need something in the meantime.

Indeed this approach seems the simplest, I agree it makes sense to have graphite and prometheus alongside during the migration

Change 379302 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/cassandra-metrics-collector@master] Update collector to 4.1.0

https://gerrit.wikimedia.org/r/379302

Change 379302 merged by Eevans:
[operations/software/cassandra-metrics-collector@master] Update collector to 4.1.0

https://gerrit.wikimedia.org/r/379302

Change 379305 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/379305

Change 379305 merged by Dzahn:
[operations/puppet@production] cassandra: Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/379305

Eevans updated the task description. (Show Details)Sep 20 2017, 9:37 PM

Eevans updated the task description. (Show Details)

v4.1.0 of cassandra-metrics-collector is deployed to nodes running 3.x, and the restbase-ng nodes are paused (cmcd is running, but it is not collecting metrics). I found the services re-enabled before getting this pushed out, so I suspect that there are metrics in graphite that need cleaning up.

In T171772#3622957, @Eevans wrote:

v4.1.0 of cassandra-metrics-collector is deployed to nodes running 3.x, and the restbase-ng nodes are paused (cmcd is running, but it is not collecting metrics). I found the services re-enabled before getting this pushed out, so I suspect that there are metrics in graphite that need cleaning up.

I enabled it yesterday on prod_ng nodes given that we are starting to use Cass 3 in production and need metrics for it. I will do so again today (@fgiunchedi OKed it).

In T171772#3623697, @mobrovac wrote:

In T171772#3622957, @Eevans wrote:

v4.1.0 of cassandra-metrics-collector is deployed to nodes running 3.x, and the restbase-ng nodes are paused (cmcd is running, but it is not collecting metrics). I found the services re-enabled before getting this pushed out, so I suspect that there are metrics in graphite that need cleaning up.

I enabled it yesterday on prod_ng nodes given that we are starting to use Cass 3 in production and need metrics for it. I will do so again today (@fgiunchedi OKed it).

OK, yeah, I missed that at the time; Re-enabling is just a matter of rm'ing /etc/cassandra-metrics-collector/disable, (which I can do if you haven't already).

EDIT: Looks like you did.

Mentioned in SAL (#wikimedia-operations) [2017-09-21T18:41:58Z] <urandom> T171772: Restarting Cassandra restbase-dev1004-a to apply locally hacked Prometheus exporter config

Mentioned in SAL (#wikimedia-operations) [2017-09-21T18:56:36Z] <urandom> T171772: Applying locally hacked Prometheus exporter config to RESTBase dev Cassandra instances

Change 379610 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Configure agent to export Cassandra histogram metrics

https://gerrit.wikimedia.org/r/379610

Change 379610 merged by Filippo Giunchedi:
[operations/puppet@production] Configure agent to export Cassandra histogram metrics

https://gerrit.wikimedia.org/r/379610

Change 380515 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Follow links (copy the link destination)

https://gerrit.wikimedia.org/r/380515

Change 380515 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: Follow links (copy the link destination)

https://gerrit.wikimedia.org/r/380515

Preliminary dashboards here and here; Closing as done.

Prometheus metrics storage for RESTBase dev environmentClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Prometheus metrics storage for RESTBase dev environment
Closed, ResolvedPublic
Actions

Related Objects
Search...