We should setup new shinken checks in beta that mirror the icinga error-rate alerts in production. This would hopefully catch more errors pre-production.
Description
Details
Related Objects
Event Timeline
Change 304263 had a related patch set uploaded (by Thcipriani):
Labs: Shinken alert for beta error rate
Actually there's an "Improvements" column in Wikimedia-production-error for exactly this sort of thing :)
That is nice! https://gerrit.wikimedia.org/r/304263 managed to get an alarm overnight on beta-cluster alerts https://lists.wikimedia.org/pipermail/betacluster-alerts/2016-November/023912.html
Subject: [Betacluster-alerts] PROBLEM alert - Graphite Labs/Mediawiki Error Rate is CRITICAL
Notification Type: PROBLEMService: Mediawiki Error Rate
Host: Graphite Labs
Address: graphite-labs.wikimedia.org
State: CRITICALDate/Time: Thu 24 Nov 00:07:02 UTC 2016
Additional Info:
Graphite shows the error rate being 0 for the last 24 hours, so the alarm should NOT have triggered
I guess the check_graphite_threshold command fails on the Shinken host.
Change 325122 had a related patch set uploaded (by Alex Monk):
Follow-up I863367b8, Ic9db0829: These two commits conflicted
Change 325122 merged by Dzahn:
Follow-up I863367b8, Ic9db0829: These two commits conflicted
http://shinken.wmflabs.org/service/graphite-labs/Mediawiki%20Error%20Rate shows it succeeding after I ran puppet on shinken and made it recheck