Page MenuHomePhabricator

Fix CirrusSearch monitoring
Closed, ResolvedPublic

Description

July 18, 2014 16:00
------------------------------------------------------------------------------
Service Critical[2014-07-18 16:36:10] SERVICE ALERT: fluorine;Slow CirrusSearch
query rate;CRITICAL;SOFT;1;CirrusSearch-slow.log_line_rate CRITICAL: 0.09
Service Ok[2014-07-18 13:36:11] SERVICE ALERT: fluorine;Slow CirrusSearch query
rate;OK;SOFT;2;CirrusSearch-slow.log_line_rate OKAY: 0.0
Service Critical[2014-07-18 13:31:11] SERVICE ALERT: fluorine;Slow CirrusSearch
query rate;CRITICAL;SOFT;1;CirrusSearch-slow.log_line_rate CRITICAL:
0.00333333333333
July 18, 2014 11:00
------------------------------------------------------------------------------
Service Ok[2014-07-18 11:51:11] SERVICE ALERT: fluorine;Slow CirrusSearch query
rate;OK;SOFT;2;CirrusSearch-slow.log_line_rate OKAY: 0.0
Service Critical[2014-07-18 11:46:11] SERVICE ALERT: fluorine;Slow CirrusSearch
query rate;CRITICAL;SOFT;1;CirrusSearch-slow.log_line_rate CRITICAL:
0.00666666666667
Service Ok[2014-07-18 11:26:11] SERVICE ALERT: fluorine;Slow CirrusSearch query
rate;OK;SOFT;2;CirrusSearch-slow.log_line_rate OKAY: 0.0
--
this service is flapping all the time.. adjust monitoring or actually make it
faster?

Refers To:
{T83985}

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:58 AM
rtimport added a project: ops-core.
rtimport set Reference to rt7924.

Reference to ticket #7662 added by dzahn

Reference to ticket #7779 added by dzahn

That seems a bit more sensitive then it ought to be.....

Status changed from 'new' to 'open' by RT_System

What should we change it to?
https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logging.pp#L117

On Fri Sep 12 17:16:34 2014, aotto wrote:

What should we change it to?

https://github.com/wikimedia/operations-
puppet/blob/production/manifests/role/logging.pp#L117

Hmmm - maybe .01 for warn and .1 for critical. Right now it'll complain if we
have a couple of slow queries a minute.

fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Apr 21 2015, 1:23 PM
fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".
fgiunchedi set Security to None.

Change 205603 had a related patch set uploaded (by Filippo Giunchedi):
logging: update CirrusSearch thresholds

https://gerrit.wikimedia.org/r/205603

Change 205603 merged by Filippo Giunchedi:
logging: update CirrusSearch thresholds

https://gerrit.wikimedia.org/r/205603

change merged, resolving for now. will reopen if it crops up again

reopening, the warning still comes up in icinga because we show both SOFT and HARD states, not sure how to best fix that. ideally only hard states would be shown

SOFT states do not send any notifications though. So it would seem ok to me to call it resolved.

If we want to avoid even seeing them in logs/history, there is " log_service_retries=1" but not sure if we can apply that on just this one service. It uses monitoring::ganglia and that doesn't have a parameter for it.

yeah that's true, I think the confusion comes from the fact that notifications are not sent but the alarm still shows when looking for critical/warning/unknown alerts in icinga's default view (both hard and soft states that is)
it'd be better for the alarm not to go into soft state at all, I think the graphite check for example can use "consider only this window of time" so maybe that's another solution too

faidon renamed this task from adjust CirrusSearch monitoring to Fix CirrusSearch monitoring.Jun 10 2015, 12:40 PM
faidon reassigned this task from fgiunchedi to Manybubbles.
faidon raised the priority of this task from Medium to High.
faidon removed a project: Patch-For-Review.
faidon subscribed.

@Manybubbles, we keep getting "Slow CirrusSearch query rate" alerts for many months now (I've pinged you on IRC before about that). These are beyond the point where they are actionable — everyone ignores them. Let's fix the alert or remove it, please :)

Manybubbles moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

I've pulled this into the Discovery team's backlog so it ranked against other stuff we need to do. I've moved it pretty close to the top.

Stakeholders: Operations/Cirrus Operators
Benefits: This alert is currently useless - its just ignored. Its supposed to warn us when we get too many slow queries. And maybe it is and we're just ignoring it - or maybe its over cry-wolfy. We need to understand how it works so we can decide.
Estimate: A couple of days - but this is a rough estimate.

None. It's still on the list but the team has been concentrating on other
things that have yet to finish. If you want a quick fix I'll +1 disabling
this check and leaving this ticket to discuss reenabling it.

None. It's still on the list but the team has been concentrating on other
things that have yet to finish. If you want a quick fix I'll +1 disabling
this check and leaving this ticket to discuss reenabling it.

I scheduled a downtime for this service of 1 month with a link to this ticket.

That means it won't send notifications (mail, IRC) but is not completely disabled and in one month it will become active again as a reminder.

I note this ticket has been placed "up for grabs" though.

I didn't realize this has had a task already. I talked with @EBernhardson about it a bit and there may be some foreshadowing of issues to come.

ebernhardson
i think it might be a sign of worse things to come. choosing one of the days with more failures (aug 29) ~1% of > queries were logged as slow
ebernhardson
while we are alerting at .004% :)
chasemp
that seems significant then

I scheduled a downtime for this service of 1 month with a link to this ticket.

Since that's over it's back as a WARNING for now:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=fluorine&service=Slow+CirrusSearch+query+rate

poked erik a bit who will chime in, sounds like it will sit for awhile and is not a fire

I've gone back to review the size of the logs over the last month, it looks to have settled down from some of the worst cases back on aug 29. Figuring out what constitues a problem and what is just normal is very hard to determine, as we intentionally support regex queries that take up to 20s to execute (if that's a good idea or not might be for another ticket).

I'd like to be able to pass on the monitoring portion of this as one of the first tasks for the discovery teams new ops engineer. We haven't made a hire yet but the process is moving along. I think a task like this will be a good way to get them involved and looked around the elasticsearch cluster and understanding what is going on with our systems.

Change 251948 had a related patch set uploaded (by Faidon Liambotis):
Kill CirrusSearch-slow-queries alert

https://gerrit.wikimedia.org/r/251948

Change 251948 merged by Faidon Liambotis:
Kill CirrusSearch-slow-queries alert

https://gerrit.wikimedia.org/r/251948

Deskana lowered the priority of this task from High to Low.Nov 24 2015, 6:18 PM
Deskana subscribed.

Lowering priority to reflect the reality of the team's prioritisation.

lmata claimed this task.
lmata edited projects, added Observability-Alerting; removed observability.
lmata moved this task from Inbox to Radar on the Observability-Alerting board.
lmata added subscribers: RKemper, lmata.

Closing this task assumes that the flapping has subsided based on a cursory look at Icinga. Please reopen if this is still an issue. cc/ @Gehel @RKemper