Page MenuHomePhabricator

Check Icinga alert on CirrusSearch response time
Closed, ResolvedPublic

Description

During recent Elasticsearch outage, response times have climbed to unreasonable levels, but no alert was raised. The alert is probably done by a graphite check on a metric that has been renamed. More details on parent task.

Details

Related Gerrit Patches:

Event Timeline

Gehel created this task.May 10 2016, 9:10 AM
Dzahn added a comment.May 10 2016, 2:25 PM

My 2 cents from the past experience: Icinga checks via graphite always sound good in theory but often end up having issues like this or similar ones. The setup may become a bit too complex where there is much room for failure.

The current alert is also against prefix search which has very low volume now. We should probably switch to comp suggest or the aggregate all queries metric. After looking at the graphs of yesterday's error the p75 or p95 may be better indicators of meltdown.

To combat renamed or missing metrics we could add an alert on the same metric for 0/null values but I'm not sure if this is just making things more complicated for little gain.

What could we do instead of graphite? Hitting search endpoints won't help as one machine having errors might be masked by the rest of the cluster under the low query volume of a check. We could perhaps have diamond do alerting directly (although we wouldn't have client side metrics like p75) but there is currently no state held between checks.

Change 290262 had a related patch set uploaded (by EBernhardson):
Point CirrusSearch alerting at more useful metrics

https://gerrit.wikimedia.org/r/290262

Change 290262 merged by Gehel:
Point CirrusSearch alerting at more useful metrics

https://gerrit.wikimedia.org/r/290262

Gehel added a comment.EditedMay 24 2016, 9:28 AM

Change 290262 has been merged, but Graphite1001 (which should export that check) has puppet disabled, pending investigation on graphite loosing metrics. The new alerts will be available once this is resolved.

I've reenabled puppet on graphite1001, the check should get exported soon!

Gehel added a comment.May 24 2016, 1:50 PM

I confirm, checks are now visible (and green) on graphite1001. Future will tell us if they are actually useful.

debt closed this task as Resolved.Jun 8 2016, 12:33 AM
debt added a subscriber: debt.

Looks like this is resolved - closing.