Page MenuHomePhabricator

tune gearman alarms
Closed, DeclinedPublic

Description

A spike happened in Gearman. The alarm kicked in at 17:39 and the recovery notification went at 17:48.

Service Ok[2017-06-16 17:48:04] SERVICE ALERT: contint1001;Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman;OK;HARD;3;OK: Less than 30.00% above the threshold [90.0]
Service Critical[2017-06-16 17:39:04] SERVICE ALERT: contint1001;Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman;CRITICAL;HARD;3;CRITICAL: 42.86% of data above the critical threshold [140.0]
Service Critical[2017-06-16 17:38:04] SERVICE ALERT: contint1001;Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman;CRITICAL;SOFT;2;CRITICAL: 33.33% of data above the critical threshold [140.0]
Service Critical[2017-06-16 17:37:04] SERVICE ALERT: contint1001;Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman;CRITICAL;SOFT;1;CRITICAL: 33.33% of data above the critical threshold [140.0]

That started at 17:30 and recovered fully at 17:43. Need some tweaking, probably in the window of check graphite

Capture d’écran 2017-06-16 à 20.44.28.png (330×645 px, 33 KB)