During recent Elasticsearch outage, response times have climbed to unreasonable levels, but no alert was raised. The alert is probably done by a graphite check on a metric that has been renamed. More details on parent task.
|operations/puppet : production||Point CirrusSearch alerting at more useful metrics|
|Resolved||debt||T134829 Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC|
|Resolved||EBernhardson||T134852 Check Icinga alert on CirrusSearch response time|
My 2 cents from the past experience: Icinga checks via graphite always sound good in theory but often end up having issues like this or similar ones. The setup may become a bit too complex where there is much room for failure.
The current alert is also against prefix search which has very low volume now. We should probably switch to comp suggest or the aggregate all queries metric. After looking at the graphs of yesterday's error the p75 or p95 may be better indicators of meltdown.
To combat renamed or missing metrics we could add an alert on the same metric for 0/null values but I'm not sure if this is just making things more complicated for little gain.
What could we do instead of graphite? Hitting search endpoints won't help as one machine having errors might be masked by the rest of the cluster under the low query volume of a check. We could perhaps have diamond do alerting directly (although we wouldn't have client side metrics like p75) but there is currently no state held between checks.
Change 290262 has been merged, but Graphite1001 (which should export that check) has puppet disabled, pending investigation on graphite loosing metrics. The new alerts will be available once this is resolved.