Page MenuHomePhabricator

Latency metrics missing
Closed, ResolvedPublic

Description

JMX latency metrics (e.g. o.a.c.metrics:type=ColumnFamily,keyspace={keyspace},scope={table},name=ReadLatency) present in previous versions of Cassandra are missing in 3.7. They now use type=Table (e.g. o.a.c.metrics:type=Table,keyspace={keyspace},scope={table},name=ReadLatency).

Event Timeline

Change 350485 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Blacklist type=Table metrics

https://gerrit.wikimedia.org/r/350485

greg subscribed.

(doesn't appear to be an incident that caused any outage)

Change 350503 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/cassandra-metrics-collector@master] Update collector version to 3.1.4

https://gerrit.wikimedia.org/r/350503

Change 350485 merged by Filippo Giunchedi:
[operations/puppet@production] Blacklist type=Table metrics

https://gerrit.wikimedia.org/r/350485

Mentioned in SAL (#wikimedia-operations) [2017-04-27T18:21:56Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase staging

Mentioned in SAL (#wikimedia-operations) [2017-04-27T18:23:43Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase production

Change 350632 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/350632

Change 350503 merged by Eevans:
[operations/software/cassandra-metrics-collector@master] Update collector version to 3.1.4

https://gerrit.wikimedia.org/r/350503

Change 350632 merged by Elukey:
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/350632

Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:44:21Z] <urandom> T163936: forcing puppet run on restbase1007

Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:46:19Z] <urandom> T163936: restarting cassandra-metrics-collector on restbase1007

Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:56:14Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase production

Mentioned in SAL (#wikimedia-operations) [2017-04-28T18:14:02Z] <urandom> T163936: disabling puppet on restbase-dev1001 (t-shooting c-m-c)

Mentioned in SAL (#wikimedia-operations) [2017-04-28T19:08:59Z] <urandom> T163936: reenabling puppet on restbase-dev1001

This is now in place, but there seems to be issues with some of the new metrics. For example, o.a.c.metrics.Table.krv.data.LiveSSTableCount.value was initially missing entirely. After some time (and through no action on my part), metrics for 1002-b began to arrive, and then eventually for 1002-a as well.

Screenshot from 2017-04-28 14-27-24.png (966×2 px, 106 KB)

At the time of this writing, there are 4 instances still missing that particular metric, of which I sampled traffic from 1001, and found it to be sending them for both instances.

The situation is similar for SSTables/read and columnfamily latency and rate.

@fgiunchedi I don't understand why this would only pertain to the new type=Table metrics, but it seems like this is a graphite-side issue; Can you see anything that would be helpful on that end?

Odd; The instances all show up now, but you can see what looks like intermittent collection on several metrics:

screen.png (917×1 px, 263 KB)

https://grafana.wikimedia.org/dashboard/db/restbase-dev

Current leading theory is that Table metrics sometimes don't make it in time to the whisper files (to be confirmed)

Eevans edited projects, added Services (done); removed Services (next).

Current leading theory is that Table metrics sometimes don't make it in time to the whisper files (to be confirmed)

I feel confident that we have confirmed this (it only ever manifested on Cassandra 3.x, see: T164093: Increased cassandra-metrics-collector utilization w/ Cassandra 3.x), and it has since gone away; We can reopen this issue if the problem reoccurs.