Page MenuHomePhabricator

Prometheus metrics missing for some hosts
Closed, ResolvedPublic

Description

The RESTBase Cassandra cluster has some nodes where Prometheus metrics (those that come from the JMX exporter), are missing.

Event Timeline

Eevans triaged this task as High priority.Apr 18 2018, 2:36 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-18T14:37:00Z] <urandom> restarting Cassandra, restbase1011-a -- T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-18T14:55:19Z] <urandom> restarting Cassandra, restbase1011-a to test v 0.8 of Prometheus JMX exporter -- T192456

For the machines affected, executing curl against the exporter URL just hangs indefinitely. I attempted to restart 1011-a to no avail. I then live-hacked cassandra-env.sh to roll back the exporter jar to the 0.8 version we used before, and it is now working. More investigation is needed.

Mentioned in SAL (#wikimedia-operations) [2018-04-19T20:48:12Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T20:48:24Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-a -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T21:11:56Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-c -- T189822, T192456

This was resolved by the upgrade to 1:0.3.0