Page MenuHomePhabricator

Cassandra/CQL query interface monitoring
Closed, ResolvedPublic

Description

We need availability monitoring of Cassandra that goes beyond monitoring the process. Though unlikely, it's possible for the process to be present, even though the node is unable to answer a query. If that were to happen now, we'd only see this via the errors piling up in RESTBase, or as a degradation of performance.

In short, a TCP port check of 9042 would be better, a custom check that ran a simple CQL query (for example: select host_id from system.local limit 1) would be better still.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Eevans raised the priority of this task from to Needs Triage.
Eevans updated the task description. (Show Details)
Eevans set Security to None.

Change 200242 had a related patch set uploaded (by Dzahn):
cassandra: add monitoring for CQL interface

https://gerrit.wikimedia.org/r/200242

Change 200242 merged by Dzahn:
cassandra: add monitoring for CQL interface

https://gerrit.wikimedia.org/r/200242

fgiunchedi claimed this task.

thanks @Dzahn ! I'm tentatively closing this as resolved, we can revisit the deeper cql query check if this isn't working well enough

Reopening to add a CQL-based check (since how we do Icinga service monitoring needs to be revisited in the wake of T95253: Finish conversion to multiple Cassandra instances per hardware node).

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

+1, nice work @Eevans. One minor nit: our icinga checks do not usually contain the pipe between the status text and the description in the output message.

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

LGTM overall! some points:

  • error messages should include a description of what when wrong
  • as @mobrovac pointed above "performance data" in icinga parlance is optional (IIRC we're not using it)
  • since there are dependencies to be included deployment will need some thought

MVP here: https://github.com/eevans/icinga-checkcql, comments welcome.

+1, nice work @Eevans. One minor nit: our icinga checks do not usually contain the pipe between the status text and the description in the output message.

According to the docs, the | separates the status text from performance data. @fgiunchedi has said we don't make use of that (I wasn't sure), but didn't think it could hurt either way.

LGTM overall! some points:

  • error messages should include a description of what when wrong

Good point; I'll take care of that.

  • as @mobrovac pointed above "performance data" in icinga parlance is optional (IIRC we're not using it)

Do you think I should remove it?

  • since there are dependencies to be included deployment will need some thought

Yeah. :(

  • error messages should include a description of what when wrong

How about this? https://github.com/eevans/icinga-checkcql/commit/4ff15cd6cd334f5822ffeaf347b9580a1abcddc0

$ ./check.js -H localhost
CRITICAL | connect=NaN; execute=NaN;
connect(): Error: All host(s) tried for query failed. First host tried, localhost:9042: Error: Connection timeout. See innerErrors.

Change 250439 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: multi-instance aware CQL checks

https://gerrit.wikimedia.org/r/250439

Change 250439 merged by Filippo Giunchedi:
cassandra: multi-instance aware CQL checks

https://gerrit.wikimedia.org/r/250439

  • error messages should include a description of what when wrong

How about this? https://github.com/eevans/icinga-checkcql/commit/4ff15cd6cd334f5822ffeaf347b9580a1abcddc0

$ ./check.js -H localhost
CRITICAL | connect=NaN; execute=NaN;
connect(): Error: All host(s) tried for query failed. First host tried, localhost:9042: Error: Connection timeout. See innerErrors.

myself and @Eevans have chatted about this a bit and decided that deepening the check from tcp-only to cql was out of scope for now. the tcp check became multi-instance aware in https://gerrit.wikimedia.org/r/250439 thus resolving