Page MenuHomePhabricator

Add the prometheus jmx agent to AQS Cassandra
Closed, ResolvedPublic8 Estimated Story Points

Description

This work should be the same that has been done by the services team to migrate their metrics to Prometheus.

Event Timeline

elukey renamed this task from Move AQS Cassandra daemons to use the Prometheus JMX agent to Add the prometheus jmx agent to AQS Cassandra.Jan 12 2018, 11:59 AM
elukey created this task.

Change 413405 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::aqs: enable Cassandra JMX exporter

https://gerrit.wikimedia.org/r/413405

As suggested by Eric, forced git fat pull to update the prometheus jmx exporter jar:

elukey@neodymium:~$ sudo cumin 'aqs*' 'ls -l /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-20170117.190412-1.jar'
6 hosts will be targeted:
aqs[1004-1009].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(2) aqs[1007,1009].eqiad.wmnet
----- OUTPUT of 'ls -l /srv/deplo...117.190412-1.jar' -----
-rw-r--r-- 1 deploy-service deploy-service 1241757 Apr 19  2017 /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-20170117.190412-1.jar
===== NODE GROUP =====
(4) aqs[1004-1006,1008].eqiad.wmnet
----- OUTPUT of 'ls -l /srv/deplo...117.190412-1.jar' -----
-rw-r--r-- 1 deploy-service deploy-service 1241757 Feb 23 08:03 /srv/deployment/prometheus/jmx_exporter/lib/jmx_prometheus_javaagent-0.8-20170117.190412-1.jar

I am wondering if after https://gerrit.wikimedia.org/r/#/c/402069/ it will be needed?

Change 413405 merged by Elukey:
[operations/puppet@production] role::aqs: enable Cassandra JMX exporter

https://gerrit.wikimedia.org/r/413405

Mentioned in SAL (#wikimedia-operations) [2018-02-27T16:53:14Z] <elukey> restart cassandra-a on aqs1004 to test the prometheus jmx agent before complete rollout - T184795

Sad news: we had to rollback due to an issue with the Cassandra 2.2.x startup script:

https://issues.apache.org/jira/browse/CASSANDRA-7254

https://github.com/apache/cassandra/blob/cassandra-2.2.6/bin/cassandra#L261

The above line starts also the new jmx javaagent due to JVM_OPTS, that in turn binds itself to port 7800 and waits for data. In turn, the cassandra startup gets stuck as well :)

@MoritzMuehlenhoff: would it be worth in your opinion to create a cassandra 2.2 component, rather than relying on thirdparty? As far as I can see cassandra 2.2.6 is in jessie-wikimedia/thirdparty..

@MoritzMuehlenhoff: would it be worth in your opinion to create a cassandra 2.2 component, rather than relying on thirdparty? As far as I can see cassandra 2.2.6 is in jessie-wikimedia/thirdparty..

thirdparty/foo is for packages we sync from external repositories (as such the packages in jessie-wikimedia are misplaced, we have some cruft there) while the cassandra packages are built by Eric, so creating a component/cassandra22 makes sense.

Change 421241 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cassandra: upgrade version 2.2 package settings

https://gerrit.wikimedia.org/r/421241

Change 421878 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::aqs: enable jmx agent

https://gerrit.wikimedia.org/r/421878

Change 421241 merged by Elukey:
[operations/puppet@production] cassandra: upgrade version 2.2 package settings for aqs

https://gerrit.wikimedia.org/r/421241

Change 421878 merged by Elukey:
[operations/puppet@production] role::aqs: enable jmx agent

https://gerrit.wikimedia.org/r/421878

Change 422103 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: poll cassandra aqs metrics

https://gerrit.wikimedia.org/r/422103

elukey set the point value for this task to 8.Mar 27 2018, 9:57 AM

Change 422103 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: poll cassandra aqs metrics

https://gerrit.wikimedia.org/r/422103

elukey moved this task from In Progress to Done on the Analytics-Kanban board.
238482n375 triaged this task as Lowest priority.
238482n375 moved this task from Done to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu