JMX latency metrics (e.g. o.a.c.metrics:type=ColumnFamily,keyspace={keyspace},scope={table},name=ReadLatency) present in previous versions of Cassandra are missing in 3.7. They now use type=Table (e.g. o.a.c.metrics:type=Table,keyspace={keyspace},scope={table},name=ReadLatency).
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Eevans | T160570 Cassandra 3.x Tracking | |||
Resolved | Eevans | T163936 Latency metrics missing |
Event Timeline
Created https://github.com/wikimedia/cassandra-metrics-collector/pull/19 (and merged)
Change 350485 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Blacklist type=Table metrics
Change 350503 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/cassandra-metrics-collector@master] Update collector version to 3.1.4
Change 350485 merged by Filippo Giunchedi:
[operations/puppet@production] Blacklist type=Table metrics
Mentioned in SAL (#wikimedia-operations) [2017-04-27T18:21:56Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase staging
Mentioned in SAL (#wikimedia-operations) [2017-04-27T18:23:43Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase production
Change 350632 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar
Change 350503 merged by Eevans:
[operations/software/cassandra-metrics-collector@master] Update collector version to 3.1.4
Change 350632 merged by Elukey:
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar
Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:44:21Z] <urandom> T163936: forcing puppet run on restbase1007
Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:46:19Z] <urandom> T163936: restarting cassandra-metrics-collector on restbase1007
Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:56:14Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase production
Mentioned in SAL (#wikimedia-operations) [2017-04-28T18:14:02Z] <urandom> T163936: disabling puppet on restbase-dev1001 (t-shooting c-m-c)
Mentioned in SAL (#wikimedia-operations) [2017-04-28T19:08:59Z] <urandom> T163936: reenabling puppet on restbase-dev1001
This is now in place, but there seems to be issues with some of the new metrics. For example, o.a.c.metrics.Table.krv.data.LiveSSTableCount.value was initially missing entirely. After some time (and through no action on my part), metrics for 1002-b began to arrive, and then eventually for 1002-a as well.
At the time of this writing, there are 4 instances still missing that particular metric, of which I sampled traffic from 1001, and found it to be sending them for both instances.
The situation is similar for SSTables/read and columnfamily latency and rate.
@fgiunchedi I don't understand why this would only pertain to the new type=Table metrics, but it seems like this is a graphite-side issue; Can you see anything that would be helpful on that end?
Odd; The instances all show up now, but you can see what looks like intermittent collection on several metrics:
Current leading theory is that Table metrics sometimes don't make it in time to the whisper files (to be confirmed)
I feel confident that we have confirmed this (it only ever manifested on Cassandra 3.x, see: T164093: Increased cassandra-metrics-collector utilization w/ Cassandra 3.x), and it has since gone away; We can reopen this issue if the problem reoccurs.