Cassandra/CQL query interface monitoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Mar 25 2015, 2:31 PM

Description

We need availability monitoring of Cassandra that goes beyond monitoring the process. Though unlikely, it's possible for the process to be present, even though the node is unable to answer a query. If that were to happen now, we'd only see this via the errors piling up in RESTBase, or as a degradation of performance.

In short, a TCP port check of 9042 would be better, a custom check that ran a simple CQL query (for example: select host_id from system.local limit 1) would be better still.

Details

	Subject	Repo	Branch	Lines +/-
	cassandra: multi-instance aware CQL checks	operations/puppet	production	+7 -7
	cassandra: add monitoring for CQL interface	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	fgiunchedi	T78514 Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts
Resolved	fgiunchedi	T93886 Cassandra/CQL query interface monitoring
		· · ·

Event Timeline

Eevans created this task.Mar 25 2015, 2:31 PM

Eevans raised the priority of this task from to Needs Triage.

Eevans updated the task description. (Show Details)

Eevans added projects: RESTBase, RESTBase-Cassandra, acl*sre-team.

Eevans added subscribers: Eevans, fgiunchedi, • mobrovac, • GWicke.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2015, 2:31 PM

Eevans added a parent task: T78514: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts.Mar 25 2015, 2:35 PM

Eevans updated the task description. (Show Details)Mar 25 2015, 2:43 PM

Eevans set Security to None.

Change 200242 had a related patch set uploaded (by Dzahn):
cassandra: add monitoring for CQL interface

https://gerrit.wikimedia.org/r/200242

gerritbot added a project: Patch-For-Review.Mar 27 2015, 9:54 PM

Change 200242 merged by Dzahn:
cassandra: add monitoring for CQL interface

https://gerrit.wikimedia.org/r/200242

Dzahn mentioned this in rOPUP5dd979f68f53: cassandra: add monitoring for CQL interface.Mar 27 2015, 10:01 PM

added TCP port check to Icinga:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=cql

thanks @Dzahn ! I'm tentatively closing this as resolved, we can revisit the deeper cql query check if this isn't working well enough

Eevans reopened this task as Open.Oct 8 2015, 5:50 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptOct 8 2015, 5:50 PM

Reopening to add a CQL-based check (since how we do Icinga service monitoring needs to be revisited in the wake of T95253: Finish conversion to multiple Cassandra instances per hardware node).

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

In T93886#1714316, @Eevans wrote:

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

+1, nice work @Eevans. One minor nit: our icinga checks do not usually contain the pipe between the status text and the description in the output message.

In T93886#1714316, @Eevans wrote:

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

LGTM overall! some points:

error messages should include a description of what when wrong
as @mobrovac pointed above "performance data" in icinga parlance is optional (IIRC we're not using it)
since there are dependencies to be included deployment will need some thought

In T93886#1714965, @mobrovac wrote:

In T93886#1714316, @Eevans wrote:

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

+1, nice work @Eevans. One minor nit: our icinga checks do not usually contain the pipe between the status text and the description in the output message.

According to the docs, the | separates the status text from performance data. @fgiunchedi has said we don't make use of that (I wasn't sure), but didn't think it could hurt either way.

In T93886#1715162, @fgiunchedi wrote:

LGTM overall! some points:

error messages should include a description of what when wrong

Good point; I'll take care of that.

as @mobrovac pointed above "performance data" in icinga parlance is optional (IIRC we're not using it)

Do you think I should remove it?

since there are dependencies to be included deployment will need some thought

Yeah. :(

Eevans mentioned this in T108306: better cassandra process checks.Oct 9 2015, 3:24 PM

In T93886#1715162, @fgiunchedi wrote:

error messages should include a description of what when wrong

How about this? https://github.com/eevans/icinga-checkcql/commit/4ff15cd6cd334f5822ffeaf347b9580a1abcddc0

$ ./check.js -H localhost
CRITICAL | connect=NaN; execute=NaN;
connect(): Error: All host(s) tried for query failed. First host tried, localhost:9042: Error: Connection timeout. See innerErrors.

Change 250439 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: multi-instance aware CQL checks

https://gerrit.wikimedia.org/r/250439

Change 250439 merged by Filippo Giunchedi:
cassandra: multi-instance aware CQL checks

https://gerrit.wikimedia.org/r/250439

In T93886#1716348, @Eevans wrote:
In T93886#1715162, @fgiunchedi wrote:

error messages should include a description of what when wrong

How about this? https://github.com/eevans/icinga-checkcql/commit/4ff15cd6cd334f5822ffeaf347b9580a1abcddc0
$ ./check.js -H localhost
CRITICAL | connect=NaN; execute=NaN;
connect(): Error: All host(s) tried for query failed. First host tried, localhost:9042: Error: Connection timeout. See innerErrors.

myself and @Eevans have chatted about this a bit and decided that deepening the check from tcp-only to cql was out of scope for now. the tcp check became multi-instance aware in https://gerrit.wikimedia.org/r/250439 thus resolving

Cassandra/CQL query interface monitoringClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Cassandra/CQL query interface monitoring
Closed, ResolvedPublic
Actions

Related Objects
Search...