Page MenuHomePhabricator

Create alerts on EventBus error rate
Closed, ResolvedPublic3 Estimated Story Points

Description

The rate of 400 errors significantly increased recently for unknown reason (see T153030) In this particular case the problem is not critical so it's not an outage (but still requires resolving fast), but the issue was unnoticed for more then 24 hours and only my involvement with other work related to ChangeProp made me notice it.

In normal operation the rate of errors in EventBus is close to zero, we've had months of running without a single 400. So we need to set up a low threshold to make non-paring inciga alerts when the rate of 400 or 500 exceeds it. I think 1 per second should be a very conservative threshold.

Related Objects

Event Timeline

Pchelolo created this task.Dec 13 2016, 2:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2016, 2:15 AM
Ottomata claimed this task.Dec 13 2016, 3:08 PM

+1, will take this one, I should have time today.

Ottomata triaged this task as Medium priority.Dec 13 2016, 3:09 PM
Ottomata added a project: Analytics.
Ottomata set the point value for this task to 3.
Nuria edited projects, added Analytics-Kanban; removed Analytics.Dec 15 2016, 5:47 PM

Change 328239 had a related patch set uploaded (by Ottomata):
Alert on EventBus service HTTP error rate

https://gerrit.wikimedia.org/r/328239

Change 328239 merged by Ottomata:
Alert on EventBus service HTTP error rate

https://gerrit.wikimedia.org/r/328239

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.Jan 3 2017, 8:23 PM
Nuria closed this task as Resolved.Jan 18 2017, 6:20 PM