Page MenuHomePhabricator

Shinken alert for beta error rate
Closed, ResolvedPublic

Description

We should setup new shinken checks in beta that mirror the icinga error-rate alerts in production. This would hopefully catch more errors pre-production.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 1 2016, 4:35 PM
Luke081515 added a subscriber: Luke081515.
greg triaged this task as High priority.Aug 5 2016, 9:03 PM
greg moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.

Change 304263 had a related patch set uploaded (by Thcipriani):
Labs: Shinken alert for beta error rate

https://gerrit.wikimedia.org/r/304263

greg moved this task from On-going to Follow-up on the Wikimedia-Incident board.

The log errors project is for actual errors, not infrastructure to deal with them.

demon added a comment.Aug 14 2016, 1:11 AM

Actually there's an "Improvements" column in Wikimedia-production-error for exactly this sort of thing :)

Change 304263 merged by Dzahn:
Labs: Shinken alert for beta error rate

https://gerrit.wikimedia.org/r/304263

That is nice! https://gerrit.wikimedia.org/r/304263 managed to get an alarm overnight on beta-cluster alerts https://lists.wikimedia.org/pipermail/betacluster-alerts/2016-November/023912.html

Subject: [Betacluster-alerts] PROBLEM alert - Graphite Labs/Mediawiki Error Rate is CRITICAL
Notification Type: PROBLEM
Service: Mediawiki Error Rate
Host: Graphite Labs
Address: graphite-labs.wikimedia.org
State: CRITICAL
Date/Time: Thu 24 Nov 00:07:02 UTC 2016
Additional Info:

Graphite shows the error rate being 0 for the last 24 hours, so the alarm should NOT have triggered

https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1479978264.585&from=-1days&target=transformNull(logstash.rate.mediawiki.exception.ERROR.rate%2C0)&target=transformNull(logstash.rate.mediawiki.fatal.ERROR.rate%2C0)

I guess the check_graphite_threshold command fails on the Shinken host.

Change 325122 had a related patch set uploaded (by Alex Monk):
Follow-up I863367b8, Ic9db0829: These two commits conflicted

https://gerrit.wikimedia.org/r/325122

Change 325122 merged by Dzahn:
Follow-up I863367b8, Ic9db0829: These two commits conflicted

https://gerrit.wikimedia.org/r/325122

Krenair closed this task as Resolved.EditedDec 5 2016, 6:17 PM
Krenair assigned this task to thcipriani.

http://shinken.wmflabs.org/service/graphite-labs/Mediawiki%20Error%20Rate shows it succeeding after I ran puppet on shinken and made it recheck

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM