Page MenuHomePhabricator

Latency metrics missing
Closed, ResolvedPublic

Description

JMX latency metrics (e.g. o.a.c.metrics:type=ColumnFamily,keyspace={keyspace},scope={table},name=ReadLatency) present in previous versions of Cassandra are missing in 3.7. They now use type=Table (e.g. o.a.c.metrics:type=Table,keyspace={keyspace},scope={table},name=ReadLatency).

Details

Related Gerrit Patches:
operations/puppet : productionLink-in upgraded cassandra-metrics-collector jar
operations/software/cassandra-metrics-collector : masterUpdate collector version to 3.1.4
operations/puppet : productionBlacklist `type=Table` metrics

Event Timeline

Eevans created this task.Apr 26 2017, 7:36 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 26 2017, 7:36 PM
Eevans moved this task from Backlog to In-Progress on the Cassandra board.Apr 26 2017, 7:44 PM

Change 350485 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Blacklist type=Table metrics

https://gerrit.wikimedia.org/r/350485

greg added a subscriber: greg.

(doesn't appear to be an incident that caused any outage)

Change 350503 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/cassandra-metrics-collector@master] Update collector version to 3.1.4

https://gerrit.wikimedia.org/r/350503

Change 350485 merged by Filippo Giunchedi:
[operations/puppet@production] Blacklist type=Table metrics

https://gerrit.wikimedia.org/r/350485

Mentioned in SAL (#wikimedia-operations) [2017-04-27T18:21:56Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase staging

Mentioned in SAL (#wikimedia-operations) [2017-04-27T18:23:43Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase production

Change 350632 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/350632

Change 350503 merged by Eevans:
[operations/software/cassandra-metrics-collector@master] Update collector version to 3.1.4

https://gerrit.wikimedia.org/r/350503

Change 350632 merged by Elukey:
[operations/puppet@production] Link-in upgraded cassandra-metrics-collector jar

https://gerrit.wikimedia.org/r/350632

Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:44:21Z] <urandom> T163936: forcing puppet run on restbase1007

Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:46:19Z] <urandom> T163936: restarting cassandra-metrics-collector on restbase1007

Mentioned in SAL (#wikimedia-operations) [2017-04-28T13:56:14Z] <urandom> T163936: restarting cassandra-metrics-collector, restbase production

Mentioned in SAL (#wikimedia-operations) [2017-04-28T18:14:02Z] <urandom> T163936: disabling puppet on restbase-dev1001 (t-shooting c-m-c)

Mentioned in SAL (#wikimedia-operations) [2017-04-28T19:08:59Z] <urandom> T163936: reenabling puppet on restbase-dev1001

This is now in place, but there seems to be issues with some of the new metrics. For example, o.a.c.metrics.Table.krv.data.LiveSSTableCount.value was initially missing entirely. After some time (and through no action on my part), metrics for 1002-b began to arrive, and then eventually for 1002-a as well.

At the time of this writing, there are 4 instances still missing that particular metric, of which I sampled traffic from 1001, and found it to be sending them for both instances.

The situation is similar for SSTables/read and columnfamily latency and rate.

@fgiunchedi I don't understand why this would only pertain to the new type=Table metrics, but it seems like this is a graphite-side issue; Can you see anything that would be helpful on that end?

Eevans added a comment.May 1 2017, 3:19 PM

Odd; The instances all show up now, but you can see what looks like intermittent collection on several metrics:

https://grafana.wikimedia.org/dashboard/db/restbase-dev

Current leading theory is that Table metrics sometimes don't make it in time to the whisper files (to be confirmed)

Eevans closed this task as Resolved.Jun 21 2017, 4:40 PM
Eevans edited projects, added Services (done); removed Services (next).

Current leading theory is that Table metrics sometimes don't make it in time to the whisper files (to be confirmed)

I feel confident that we have confirmed this (it only ever manifested on Cassandra 3.x, see: T164093: Increased cassandra-metrics-collector utilization w/ Cassandra 3.x), and it has since gone away; We can reopen this issue if the problem reoccurs.