Page MenuHomePhabricator

configure additional Cassandra metrics for graphing and/or alerts
Closed, ResolvedPublic

Description

Proposed list of additional metrics.

metric(s)graphalertthresholdcomments
o.a.c.metrics.ClientRequestMetrics.ReadTimeoutsyncount of timeout exceptions on read; no alert needed, RESTBase 5xx thresholds will capture this
o.a.c.metrics.ClientRequestMetrics.ReadUnavailablesyncount of unavailable exceptions on read; no alert needed, RESTBase 5xx thresholds will capture this
o.a.c.metrics.ClientRequestMetrics.WriteTimeoutsyncount of timeout exceptions on write; no alert needed, RESTBase 5xx thresholds will capture this
o.a.c.metrics.ClientRequestMetrics.WriteUnavailablesyncount of unavailable exceptions on write; no alert needed, RESTBase 5xx thresholds will capture this
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.EstimatedRowSizeHistogramyybroken (see: https://phabricator.wikimedia.org/T97024)
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.EstimatedColumnCountHistogramyybroken (see: https://phabricator.wikimedia.org/T97024)
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.ReadLatencyynno alert needed, RESTBase latency alerts will capture this
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.WriteLatencyynno alert needed, RESTBase latency alerts will capture this
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.SSTablesPerReadHistogramyywarn=6, critical=10
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.TombstoneScannedHistogramyywarn=1000, critical=1500
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.CompressionRatioyn
o.a.c.metrics.ColumnFamily.{keyspace}.{columnfamily}.KeyCacheHitRateyn
o.a.c.metrics.FileCache.HitRateyn
o.a.c.metrics.FileCache.Hitsyn
o.a.c.metrics.FileCache.Requestsyn
o.a.c.metrics.FileCache.Sizeyn
o.a.c.metrics.ReadRepair.Attemptedyn
o.a.c.metrics.ReadRepair.RepairBackgroundyn
o.a.c.metrics.ReadRepair.RepairedBlockingyn
o.a.c.metrics.Storage.Exceptionsyywarn=5, critical=10
o.a.c.metrics.Storage.TotalHintsyywarn=1000, critical=2000
o.a.c.metrics.Storage.TotalHintsInProgressyn
o.a.c.metrics.KeyCache.Hitsyn
o.a.c.metrics.KeyCache.Requestsyn
o.a.c.metrics.RowCache.Hitsyn
o.a.c.metrics.RowCache.Requestsyn

Event Timeline

Eevans raised the priority of this task from to Needs Triage.
Eevans updated the task description. (Show Details)
Eevans added a subscriber: Eevans.
Eevans renamed this task from WIP: configure additional Cassandra metrics for graphing and/or alerts to configure additional Cassandra metrics for graphing and/or alerts.Jun 8 2015, 9:50 PM
Eevans updated the task description. (Show Details)

JSON exports of the new Grafana dashboards:

Eevans updated the task description. (Show Details)

In addition to these new dashboards, I would also suggest refactoring http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad a bit.

I would like to:

My reasoning for this is that a) by reducing the number of graphs that dashboard will become more responsive (particularly for the larger time ranges), and b) renaming will make it consistent with the others.

@Eevans, I actually like having one dashboard with all important cassandra metrics. I think http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad is actually pretty close to that.

Do you mind creating separate dashboards if you think that cassandra-restbase-eqiad has too much stuff on it?

@Eevans, I actually like having one dashboard with all important cassandra metrics. I think http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad is actually pretty close to that.

I would tend to agree, if that page were more usable (to me).

Do you mind creating separate dashboards if you think that cassandra-restbase-eqiad has too much stuff on it?

No, I don't mind.

Also, unless there are objections, I will change all of these to UTC (they currently express data/time using the browser's locale). Given our distributed team, it'll make it easier for us to refer to time ranges in a common timezone, and it will be easier to compare with other graphs (i.e. Ganglia).

Also, unless there are objections, I will change all of these to UTC (they currently express data/time using the browser's locale). Given our distributed team, it'll make it easier for us to refer to time ranges in a common timezone, and it will be easier to compare with other graphs (i.e. Ganglia).

On a related note. Granfana 2.x has a 'share' feature that provides a permanent link to the data currently displayed. Given how often we've attached screenshots to phabricator issues, that would be a nice feature to have.

Note that I have updated https://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad to use wildcard metric patterns throughout. This makes sure that new hosts show up automatically as we add them.

Change 218408 had a related patch set uploaded (by Eevans):
WIP: configure additional Cassandra metric alerts

https://gerrit.wikimedia.org/r/218408

Also, unless there are objections, I will change all of these to UTC (they currently express data/time using the browser's locale). Given our distributed team, it'll make it easier for us to refer to time ranges in a common timezone, and it will be easier to compare with other graphs (i.e. Ganglia).

Done.

Add dashboards for dropped messages, and thread-pool pending tasks, both of which have proven to be excellent problem indicators.

http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-dropped-messages
http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-thread-pools

Change 218408 merged by Filippo Giunchedi:
configure additional Cassandra metric alerts

https://gerrit.wikimedia.org/r/218408

Change 221815 had a related patch set uploaded (by Filippo Giunchedi):
restbase: drop illegal characters from alarm description

https://gerrit.wikimedia.org/r/221815

Change 221815 merged by Filippo Giunchedi:
restbase: drop illegal characters from alarm description

https://gerrit.wikimedia.org/r/221815

It seems the non-heap usage graph has started showing invalid values today (it was working fine yesterday).

shot-2015-07-01_10-40-24.jpg (1×2 px, 134 KB)

It seems the non-heap usage graph has started showing invalid values today (it was working fine yesterday).

shot-2015-07-01_10-40-24.jpg (1×2 px, 134 KB)

This is behavior is limited to the nodes that are running jdk8 (and these are in fact the values being returned).

Let's let the dust settle on this, and wait until all nodes are on the some version (the same version) JVM, and tackle any anomalous metrics (if any), at that time.

Eevans updated the task description. (Show Details)

A preliminary dashboard of estimated row size, and column count has been created: http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-cf-rowcolumn-size

A patch to cassandra-metrics-collector to fix the incorrectly reported non-heap usage on JDK 8 can be found here: https://github.com/wikimedia/cassandra-metrics-collector/pull/2

Change 230582 had a related patch set uploaded (by Eevans):
update collector version

https://gerrit.wikimedia.org/r/230582

Change 230589 had a related patch set uploaded (by Eevans):
update cassandra-metrics-collector version

https://gerrit.wikimedia.org/r/230589

Change 230582 merged by Filippo Giunchedi:
update collector version

https://gerrit.wikimedia.org/r/230582

Change 230589 merged by Filippo Giunchedi:
update cassandra-metrics-collector version

https://gerrit.wikimedia.org/r/230589

Eevans claimed this task.

The non-heap usage graphs are now sane: http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-gc?from=now-2h&to=now&var-node=All&panelId=35&fullscreen

I believe that addresses everything within the scope of this issue; Closing.