Page MenuHomePhabricator

Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts
Closed, ResolvedPublic

Description

We should set up graphite monitoring for important cassandra metrics like JVM heap space, response times, errors etc.

This blog post mentions that reporters are already built in, so we might only need to drop a line into cassandra-env.sh and add a metric config similar to the one documented there. We should probably order things by cluster name and DC in graphite.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thank you @fgiunchedi for the very detailed metrics now available!

A dashboard to track heap metrics is now set up at http://grafana.wikimedia.org/#/dashboard/db/cassandra-heap

This is fairly performance-oriented. We should set up one cassandra health dashboard as well. Candidates for metrics:

  • node uptime
  • request latency and error rates
  • disk space and bottom-line heap usage (to warn on impending OOM)
GWicke renamed this task from Detailed cassandra monitoring to Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts.Mar 5 2015, 9:07 PM

Change 199264 had a related patch set uploaded (by Eevans):
send additional metrics to graphite

https://gerrit.wikimedia.org/r/199264

@Eevans, icinga alerts on graphite data are set up in puppet like this. We have a services contact group that mails services@, which we should probably configure as the contact for these alerts in the first step. Not sure about where that is done, but @Dzahn or @yuvipanda might know.

I think we need to have graphs setup for most/all of these metrics we are collecting, (including those to be added by https://gerrit.wikimedia.org/r/199264). I know that as it stands now, you can build ad hoc plots in Graphite as needed, but having these predefined ahead of time lowers the barrier to getting at the data. It's also not going to be obvious to everyone from looking at the metric names/hierarchy, what these values mean. Having these readied ahead of time means that metrics can be grouped to graphs, and graphs to pages, in whatever way makes sense (CF statistics, caching, repair, etc), with contextually appropriate labels, aggregation, and scaling.

Depending on the tooling available, this is potentially a lot of work (we're talking about quite a few graphs).

One option is Graphite, but I'm unable to log in due to T93158, so I'm not sure how easy this would be there. Grafana would probably work, it's templating seems to be a good fit for this kind of thing (http://docs.grafana.org/reference/templating), but I have many problems doing even the simplest of things (adding graph panels, deleting dashboards, etc). It also seems to be wide-open to vandalism (T93710).

Does anyone have any suggestions for how to tackle this?

Also, I think we need availability monitoring of Cassandra that goes beyond monitoring the process. Though unlikely, it's possible for the process to be present, even though the node is unable to answer a query. In short, a TCP port check of 9042 would be better, a custom check that ran a simple CQL query (for example: select host_id from system.local limit 1) would be better still.

Is it enough to write an incinga check script for this?

As for alerts, we also need a small number of threshold-generated notifications for some of these performance metrics. In particular, T93140 has demonstrated that we need to be made aware when compaction starts to back up.

Change 199264 merged by Filippo Giunchedi:
send additional metrics to graphite

https://gerrit.wikimedia.org/r/199264

Change 199264 merged by Filippo Giunchedi:
send additional metrics to graphite

https://gerrit.wikimedia.org/r/199264

FYI, after this merge I roll-restarted the cluster and things are looking good.

btw that added a lot of metrics, consuming ~9% of disk space on graphite1001, e.g.

graphite1001:/var/lib/carbon/whisper$ du -hcs cassandra/restbase1001/org/apache/cassandra/metrics/*
5.6M	cassandra/restbase1001/org/apache/cassandra/metrics/CQL
46M	cassandra/restbase1001/org/apache/cassandra/metrics/Cache
173M	cassandra/restbase1001/org/apache/cassandra/metrics/ClientRequest
4.5M	cassandra/restbase1001/org/apache/cassandra/metrics/ClientRequestMetrics
11G	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily
37M	cassandra/restbase1001/org/apache/cassandra/metrics/CommitLog
9.0M	cassandra/restbase1001/org/apache/cassandra/metrics/Compaction
73M	cassandra/restbase1001/org/apache/cassandra/metrics/Connection
51M	cassandra/restbase1001/org/apache/cassandra/metrics/DroppedMessage
14M	cassandra/restbase1001/org/apache/cassandra/metrics/FileCache
948M	cassandra/restbase1001/org/apache/cassandra/metrics/IndexColumnFamily
17M	cassandra/restbase1001/org/apache/cassandra/metrics/ReadRepair
4.5M	cassandra/restbase1001/org/apache/cassandra/metrics/Storage
148M	cassandra/restbase1001/org/apache/cassandra/metrics/ThreadPools
13G	total
graphite1001:/var/lib/carbon/whisper$ du -hcs cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/*
132M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/all
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_phase0_T_parsoid_dataDVIsgzJSne8kBhf
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_phase0_T_parsoid_dataW4ULtxs1oMqJeY1
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_phase0_T_parsoid_html
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_phase0_T_parsoid_wikitext
948M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_phase0_T_title__revisions
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikipedia_T_parsoid_dataDVIsgzJSne8k
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikipedia_T_parsoid_html
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikipedia_T_parsoid_wikitext
948M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikipedia_T_title__revisions
4.0G	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/system
711M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/system_auth
474M	cassandra/restbase1001/org/apache/cassandra/metrics/ColumnFamily/system_traces
11G	total
graphite1001:/var/lib/carbon/whisper$

I think we need to filter the column family metrics to the relevant ones (at least), not sure what's with the random suffix for example

I think we need to filter the column family metrics to the relevant ones (at least)

Agreed. IMHO, system could be dropped, we are really interested in data keyspaces.

not sure what's with the random suffix for example

That's added by RESTBase when creating the keyspaces.

I think we need to filter the column family metrics to the relevant ones (at least)

Agreed. IMHO, system could be dropped, we are really interested in data keyspaces.

ack, that will help too!

not sure what's with the random suffix for example

That's added by RESTBase when creating the keyspaces.

does it change and/or expire or it is fixed?

btw that added a lot of metrics, consuming ~9% of disk space on graphite1001, e.g.

Is that too much?

I think we need to filter the column family metrics to the relevant ones (at least), not sure what's with the random suffix for example

We could probably live without system, but I don't think that's going to change things overall by much.

not sure what's with the random suffix for example

That's added by RESTBase when creating the keyspaces.

does it change and/or expire or it is fixed?

It is fixed. Cassandra only allows a very limited character set for keyspace names, which does not even include a hyphen. Thus, we use some hashing tricks to generate a mostly-readable yet unique table name even if the original name included invalid characters (in this case 'data-parsoid' and 'data-mw').

btw that added a lot of metrics, consuming ~9% of disk space on graphite1001, e.g.

Is that too much?

unfortunately at the current rate it is, graphite has ~800GB with 90% used now, see also https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=graphite1001.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1426676089&v=85.0&m=part_max_used&vl=%25&ti=Maximum%20Disk%20Space%20Used&z=large

I think we need to filter the column family metrics to the relevant ones (at least), not sure what's with the random suffix for example

We could probably live without system, but I don't think that's going to change things overall by much.

yep that will be great, looks like system takes up 30% of all cassandra metrics

@fgiunchedi: Since we depend on this & are on track to add more stats with each service -- what is the plan / timeline for scaling graphite? Can we plug in a large SSD to avoid storage space being the bottleneck in the short term?

Also created T93942 for cleaning up old restbase metrics that just take up disk space at this point.

Edit: Just found & subscribed to T85451.

@fgiunchedi: Since we depend on this & are on track to add more stats with each service -- what is the plan / timeline for scaling graphite? Can we plug in a large SSD to avoid storage space being the bottleneck in the short term?

sadly I think the machine is already maxed with 4x SSD, how many metrics you had in mind? orders of magnitude is fine, each metric takes ~1MB with ~84G left on the filesystem and another ~200GB left in the VG

sadly I think the machine is already maxed with 4x SSD, how many metrics you had in mind? orders of magnitude is fine, each metric takes ~1MB with ~84G left on the filesystem and another ~200GB left in the VG

For example, more Cassandra nodes will add more metrics. Once we restart the test cluster that'll add three node's worth of metrics, followed likely by three more production nodes. Each service will have its own set op metrics, although those are typically aggregated by service (not by node) & thus don't take up a lot of space.

Ganglia suggests that the graphite box is only lightly loaded, with the main bottleneck being disk space. Could we replace the current small SSDs with something larger like 1T Samsung 850 Pros (around $630) ?.

Change 199998 had a related patch set uploaded (by Eevans):
trim list of Cassandra metrics

https://gerrit.wikimedia.org/r/199998

To summarize the conversation on #wikimedia-operations earlier today, there isn't enough Graphite storage to accommodate the newly added metrics for all Cassandra hosts.

While there aren't any of these metrics I would classify as unneeded, or optional, (certainly not enough to address the deficit), if we don't have the space, we don't have the space.

https://gerrit.wikimedia.org/r/199998 should filter out org.apache.cassandra.metrics.ColumnFamily.{system, system_auth, system_traces}.

@fgiunchedi, let me know if this will be enough to bring us within bounds.

15:14 < godog> urandom gwicke the additional cassandra metrics from the test cluster are going to fill up graphite's disk btw
15:23 < gwicke> godog: maybe, urandom knows more about which we absolutely need & which are optional
15:28 < urandom> gwicke, godog: I'm not sure how to answer that question when it is framed as "need".
15:29 < urandom> if there are any metrics that are truly extraneous, it's so few it won't make a difference
15:31 < urandom> godog: but it sounds like there is only room to have so many, so we should get rid of some, whether they are needed or not

Change 199998 merged by Filippo Giunchedi:
trim list of Cassandra metrics

https://gerrit.wikimedia.org/r/199998

Change 215004 had a related patch set uploaded (by GWicke):
Add basic alerts on RESTBase error rates and storage latencies

https://gerrit.wikimedia.org/r/215004

Change 215004 merged by Filippo Giunchedi:
Add basic alerts on RESTBase error rates and storage latencies

https://gerrit.wikimedia.org/r/215004

Change 216863 had a related patch set uploaded (by Filippo Giunchedi):
restbase: add error rates and storage latencies alerts

https://gerrit.wikimedia.org/r/216863

Change 216863 merged by Filippo Giunchedi:
restbase: add error rates and storage latencies alerts

https://gerrit.wikimedia.org/r/216863

Change 216892 had a related patch set uploaded (by Filippo Giunchedi):
restbase: fix graphite description

https://gerrit.wikimedia.org/r/216892

Change 216892 merged by Filippo Giunchedi:
restbase: fix graphite description

https://gerrit.wikimedia.org/r/216892

Change 216893 had a related patch set uploaded (by Filippo Giunchedi):
icinga: add team-services contactgroup

https://gerrit.wikimedia.org/r/216893

Change 216893 merged by Filippo Giunchedi:
icinga: add team-services contactgroup

https://gerrit.wikimedia.org/r/216893

@fgiunchedi, thank you for working out the nagios issues!

Are the alerts on error rates and response times now active?

@GWicke yep they are, going to team-services (IOW emailing services@)

however looks like the 5xx alert gets some undefined datapoints?

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1001&service=RESTBase+req%2Fs+returning+5xx

Hmm, that's odd. I wonder if it's related to selecting from the '10min' interval, although there are other alerts using that which I assume are working.

I lifted the corresponding graphite line from http://grafana.wikimedia.org/#/dashboard/db/restbase, and it doesn't use any functions that could complicate matters.

Change 217447 had a related patch set uploaded (by Filippo Giunchedi):
restbase: use transformNull for graphite metrics

https://gerrit.wikimedia.org/r/217447

yup that seems to work, http://graphite.wikimedia.org/render/?width=945&height=371&_salt=1433956794.562&target=transformNull%28restbase.v1_page_html_-title-_-revision--_tid-.GET.5xx.sample_rate,%200%29&from=-1hours

see related review, note that this will also mask legitimate cases where the metric isn't being pushed anymore but I'm not seeing any way around that ATM

Change 217447 merged by Filippo Giunchedi:
restbase: use transformNull for graphite metrics

https://gerrit.wikimedia.org/r/217447

Metrics have been less than reliable recently. We have observed several instances of metric reporting stopping after encountering errors in the graphite connection. https://github.com/dropwizard/metrics/commit/dd0fca2f4d89497ee57284410c2db0c72a63296a looks like a fix for the same issue, but is in the newer dropwizard copy of the reporter.

I tried dropping the dropwizard core and graphite jars into lib, but it seems that Cassandra is built against the yammer 2.2.0 version. If it is too complex to upgrade the reporting to dropwizard, then we could try to back-port the fix to 2.2.0.

Metrics have been less than reliable recently. We have observed several instances of metric reporting stopping after encountering errors in the graphite connection. https://github.com/dropwizard/metrics/commit/dd0fca2f4d89497ee57284410c2db0c72a63296a looks like a fix for the same issue, but is in the newer dropwizard copy of the reporter.

I tried dropping the dropwizard core and graphite jars into lib, but it seems that Cassandra is built against the yammer 2.2.0 version. If it is too complex to upgrade the reporting to dropwizard, then we could try to back-port the fix to 2.2.0.

A lot has changed since 2.2.0 and the code change referenced here isn't applicable to what we have, TTBMK.

Have you actually observed a case where the reporter indefinitely stops after a Graphite error? I have observed cases where after a gap in the graphs, I was able to find connection errors in the logs, but the reporter always resumed activity when able to. This seems consistent with the 2.2.0 code, where a new connection is created for each report cycle, and where any connection-related exception should be handled. See:

https://github.com/dropwizard/metrics/blob/v2.2.0/metrics-graphite/src/main/java/com/yammer/metrics/reporting/GraphiteReporter.java#L203

In production, on 2.1.6, I have witnessed (consistent, predictable) behavior where the reporter reports exactly once after startup, and then ceases to ever report again, with nothing whatsoever logged.

We have seen reporting stop on 2.1.3, 2.1.6 and 2.1.7. It seems to happen more quickly on 2.1.7 (often after < 12 hours) than on 2.1.3 (where it was usually fine for weeks).

Maybe, can you elaborate?

..com/yammer/metrics/reporting/GraphiteReporter.java#L229 should call Socket#close so long as it is not null (so long as it actually has a close method). Without the test for being connected / already closed, it might except, but the IOException is handled (Socket#close), and subsequent report cycles will create an entirely new socket.

Without the test for being connected / already closed, it might except, but the IOException is handled (Socket#close), and subsequent report cycles will create an entirely new socket.

Yeah, that does indeed look like it *ought* to work. Is there any chance of getting an exception that's not an IOException from .close()?

Without the test for being connected / already closed, it might except, but the IOException is handled (Socket#close), and subsequent report cycles will create an entirely new socket.

Yeah, that does indeed look like it *ought* to work. Is there any chance of getting an exception that's not an IOException from .close()?

It would have to be an unchecked exception (RuntimeException or one of its subclasses), which seems pretty unlikely coming from Socket. And, an unhandled exception here should still show up in the logs.

a proposed stopgap for the metrics that has been suggested by @ori is to whip up a diamond collector to extract essential cassandra metrics from either nodetool or jmx, that won't give us the complete set but at least some basic troubleshooting indicators, thoughts?

Change 220650 had a related patch set uploaded (by Filippo Giunchedi):
diamond: add cassandra collector for basic metrics

https://gerrit.wikimedia.org/r/220650

Change 220650 merged by Filippo Giunchedi:
diamond: add cassandra collector for basic metrics

https://gerrit.wikimedia.org/r/220650

Change 221773 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: enable diamond collector

https://gerrit.wikimedia.org/r/221773

Change 221773 merged by Filippo Giunchedi:
cassandra: enable diamond collector

https://gerrit.wikimedia.org/r/221773

We discussed this on IRC, but didn't mention it here yet: The histograms reported by the graphite reporter aren't compatible with our graphite install:

13/07/2015 21:59:04 :: [listener] invalid line received from client 127.0.0.1:58728, ignoring 'cassandra.praseodymium.org.apache.cassandra.metrics.ColumnFamily.local_group_wikipedia_T_parsoid_section_offsets.data.EstimatedColumnCountHistogram.value [J@79bcbc2 1436824731'

We discussed this on IRC, but didn't mention it here yet: The histograms reported by the graphite reporter aren't compatible with our graphite install:

13/07/2015 21:59:04 :: [listener] invalid line received from client 127.0.0.1:58728, ignoring 'cassandra.praseodymium.org.apache.cassandra.metrics.ColumnFamily.local_group_wikipedia_T_parsoid_section_offsets.data.EstimatedColumnCountHistogram.value [J@79bcbc2 1436824731'

This is because the value here is a long[], which is obviously something Graphitie cannot use (I consider this a Cassandra bug). We could work around this fairly easy in https://github.com/eevans/cassandra-metrics-collector though (by expanding the array into separate metrics). It might also be worth pushing a fix upstream as well (same thing, add alterrnative metrics for the expanded array).

We discussed this on IRC, but didn't mention it here yet: The histograms reported by the graphite reporter aren't compatible with our graphite install:

13/07/2015 21:59:04 :: [listener] invalid line received from client 127.0.0.1:58728, ignoring 'cassandra.praseodymium.org.apache.cassandra.metrics.ColumnFamily.local_group_wikipedia_T_parsoid_section_offsets.data.EstimatedColumnCountHistogram.value [J@79bcbc2 1436824731'

I've reported this at the end of April in T97024: some cassandra metrics sent with invalid values and it is a bug in cassandra like @Eevans mentioned

Change 229387 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: remove obsolete diamond collector

https://gerrit.wikimedia.org/r/229387

Change 229387 merged by Filippo Giunchedi:
cassandra: remove obsolete diamond collector

https://gerrit.wikimedia.org/r/229387

Closing as resolved, considering our fairly good alert coverage at this point. We can track further improvements to our alerts in separate tasks.

Change 247910 had a related patch set uploaded (by Dzahn):
Disable AQS cassandra CQL interface check until AQS is production ready

https://gerrit.wikimedia.org/r/247910

Change 247910 abandoned by Ottomata:
Disable AQS cassandra CQL interface check until AQS is production ready

Reason:
not needed

https://gerrit.wikimedia.org/r/247910