Fix CirrusSearch monitoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Jul 18 2014, 4:41 PM

Description

July 18, 2014 16:00
------------------------------------------------------------------------------
Service Critical[2014-07-18 16:36:10] SERVICE ALERT: fluorine;Slow CirrusSearch
query rate;CRITICAL;SOFT;1;CirrusSearch-slow.log_line_rate CRITICAL: 0.09
Service Ok[2014-07-18 13:36:11] SERVICE ALERT: fluorine;Slow CirrusSearch query
rate;OK;SOFT;2;CirrusSearch-slow.log_line_rate OKAY: 0.0
Service Critical[2014-07-18 13:31:11] SERVICE ALERT: fluorine;Slow CirrusSearch
query rate;CRITICAL;SOFT;1;CirrusSearch-slow.log_line_rate CRITICAL:
0.00333333333333
July 18, 2014 11:00
------------------------------------------------------------------------------
Service Ok[2014-07-18 11:51:11] SERVICE ALERT: fluorine;Slow CirrusSearch query
rate;OK;SOFT;2;CirrusSearch-slow.log_line_rate OKAY: 0.0
Service Critical[2014-07-18 11:46:11] SERVICE ALERT: fluorine;Slow CirrusSearch
query rate;CRITICAL;SOFT;1;CirrusSearch-slow.log_line_rate CRITICAL:
0.00666666666667
Service Ok[2014-07-18 11:26:11] SERVICE ALERT: fluorine;Slow CirrusSearch query
rate;OK;SOFT;2;CirrusSearch-slow.log_line_rate OKAY: 0.0
--
this service is flapping all the time.. adjust monitoring or actually make it
faster?

Refers To:
{T83985}

Details

Reference: rt7924

	Subject	Repo	Branch	Lines +/-
	Kill CirrusSearch-slow-queries alert	operations/puppet	production	+0 -28
	logging: update CirrusSearch thresholds	operations/puppet	production	+7 -11

Customize query in gerrit

Related Objects

Mentioned In: T106687: Flow does not support varying language of parts of content based on user interface language (e.g. {{int:}})
rOPUPd24ff7565ffe: logging: update CirrusSearch thresholds

Event Timeline

• rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:58 AM

• rtimport added a project: ops-core.

• rtimport set Reference to rt7924.

Dzahn created this task.Jul 18 2014, 4:41 PM

Reference to ticket #7662 added by dzahn

Reference to ticket #7779 added by dzahn

That seems a bit more sensitive then it ought to be.....

Status changed from 'new' to 'open' by RT_System

What should we change it to?
https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logging.pp#L117

On Fri Sep 12 17:16:34 2014, aotto wrote:

What should we change it to?

https://github.com/wikimedia/operations-
puppet/blob/production/manifests/role/logging.pp#L117

Hmmm - maybe .01 for warn and .1 for critical. Right now it'll complain if we
have a couple of slow queries a minute.

fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Apr 21 2015, 1:23 PM

fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".

fgiunchedi set Security to None.

Change 205603 had a related patch set uploaded (by Filippo Giunchedi):
logging: update CirrusSearch thresholds

https://gerrit.wikimedia.org/r/205603

gerritbot added a project: Patch-For-Review.Apr 21 2015, 1:57 PM

fgiunchedi claimed this task.Apr 21 2015, 1:58 PM

Change 205603 merged by Filippo Giunchedi:
logging: update CirrusSearch thresholds

https://gerrit.wikimedia.org/r/205603

fgiunchedi mentioned this in rOPUPd24ff7565ffe: logging: update CirrusSearch thresholds.Apr 22 2015, 8:30 AM

change merged, resolving for now. will reopen if it crops up again

reopening, the warning still comes up in icinga because we show both SOFT and HARD states, not sure how to best fix that. ideally only hard states would be shown

SOFT states do not send any notifications though. So it would seem ok to me to call it resolved.

If we want to avoid even seeing them in logs/history, there is " log_service_retries=1" but not sure if we can apply that on just this one service. It uses monitoring::ganglia and that doesn't have a parameter for it.

yeah that's true, I think the confusion comes from the fact that notifications are not sent but the alarm still shows when looking for critical/warning/unknown alerts in icinga's default view (both hard and soft states that is)
it'd be better for the alarm not to go into soft state at all, I think the graphite check for example can use "consider only this window of time" so maybe that's another solution too

@Manybubbles, we keep getting "Slow CirrusSearch query rate" alerts for many months now (I've pinged you on IRC before about that). These are beyond the point where they are actionable — everyone ignores them. Let's fix the alert or remove it, please :)

I've pulled this into the Discovery team's backlog so it ranked against other stuff we need to do. I've moved it pretty close to the top.

Stakeholders: Operations/Cirrus Operators
Benefits: This alert is currently useless - its just ignored. Its supposed to warn us when we get too many slow queries. And maybe it is and we're just ignoring it - or maybe its over cry-wolfy. We need to understand how it works so we can decide.
Estimate: A couple of days - but this is a rough estimate.

Any news?

None. It's still on the list but the team has been concentrating on other
things that have yet to finish. If you want a quick fix I'll +1 disabling
this check and leaving this ticket to discuss reenabling it.

• Deskana removed • Manybubbles as the assignee of this task.Sep 10 2015, 4:53 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 10 2015, 4:53 PM

In T84163#1387421, @Manybubbles wrote:

None. It's still on the list but the team has been concentrating on other
things that have yet to finish. If you want a quick fix I'll +1 disabling
this check and leaving this ticket to discuss reenabling it.

I scheduled a downtime for this service of 1 month with a link to this ticket.

That means it won't send notifications (mail, IRC) but is not completely disabled and in one month it will become active again as a reminder.

I note this ticket has been placed "up for grabs" though.

I didn't realize this has had a task already. I talked with @EBernhardson about it a bit and there may be some foreshadowing of issues to come.

ebernhardson
i think it might be a sign of worse things to come. choosing one of the days with more failures (aug 29) ~1% of > queries were logged as slow
ebernhardson
while we are alerting at .004% :)
chasemp
that seems significant then

In T84163#1647247, @Dzahn wrote:

I scheduled a downtime for this service of 1 month with a link to this ticket.

Since that's over it's back as a WARNING for now:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=fluorine&service=Slow+CirrusSearch+query+rate

poked erik a bit who will chime in, sounds like it will sit for awhile and is not a fire

I've gone back to review the size of the logs over the last month, it looks to have settled down from some of the worst cases back on aug 29. Figuring out what constitues a problem and what is just normal is very hard to determine, as we intentionally support regex queries that take up to 20s to execute (if that's a good idea or not might be for another ticket).

I'd like to be able to pass on the monitoring portion of this as one of the first tasks for the discovery teams new ops engineer. We haven't made a hire yet but the process is moving along. I think a task like this will be a good way to get them involved and looked around the elasticsearch cluster and understanding what is going on with our systems.

Change 251948 had a related patch set uploaded (by Faidon Liambotis):
Kill CirrusSearch-slow-queries alert

https://gerrit.wikimedia.org/r/251948

Change 251948 merged by Faidon Liambotis:
Kill CirrusSearch-slow-queries alert

https://gerrit.wikimedia.org/r/251948

• chasemp removed a project: Patch-For-Review.Nov 9 2015, 3:09 PM

• Deskana moved this task from Search to Ops on the Discovery-ARCHIVED board.Nov 24 2015, 6:06 PM

Lowering priority to reflect the reality of the team's prioritisation.

Gehel subscribed.Feb 1 2016, 2:07 PM

Dzahn unsubscribed.Sep 20 2016, 1:11 AM

Liuxinyu970226 mentioned this in T106687: Flow does not support varying language of parts of content based on user interface language (e.g. {{int:}}).Jan 22 2017, 3:20 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 7:55 PM

Aklapper added a project: observability.Jul 23 2022, 2:35 AM

Closing this task assumes that the flapping has subsided based on a cursory look at Icinga. Please reopen if this is still an issue. cc/ @Gehel @RKemper

lmata moved this task from Radar to Done on the Observability-Alerting board.Jan 16 2023, 5:58 PM

Fix CirrusSearch monitoringClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Fix CirrusSearch monitoring
Closed, ResolvedPublic
Actions