Page MenuHomePhabricator

Prometheus metrics missing for some hosts
Closed, ResolvedPublic

Description

The RESTBase Cassandra cluster has some nodes where Prometheus metrics (those that come from the JMX exporter), are missing.

Event Timeline

Eevans created this task.Apr 18 2018, 2:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2018, 2:36 PM
Eevans triaged this task as High priority.Apr 18 2018, 2:36 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-18T14:37:00Z] <urandom> restarting Cassandra, restbase1011-a -- T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-18T14:55:19Z] <urandom> restarting Cassandra, restbase1011-a to test v 0.8 of Prometheus JMX exporter -- T192456

Eevans added a subscriber: elukey.Apr 18 2018, 3:08 PM

For the machines affected, executing curl against the exporter URL just hangs indefinitely. I attempted to restart 1011-a to no avail. I then live-hacked cassandra-env.sh to roll back the exporter jar to the 0.8 version we used before, and it is now working. More investigation is needed.

Mentioned in SAL (#wikimedia-operations) [2018-04-19T20:48:12Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T20:48:24Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-a -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T21:11:56Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-c -- T189822, T192456

Eevans closed this task as Resolved.Jul 3 2018, 10:09 PM

This was resolved by the upgrade to 1:0.3.0