Page MenuHomePhabricator

Tune Varnishkafka delivery errors to be more sensitive
Closed, ResolvedPublic

Description

In T172681 no varnishkafka alarm fired and we didn't get any notice of the error until somebody looked in grafana by chance.

I tried in T172681 to tune the alarms but I failed, so I rolled-back to the previous version with a lower critical threshold (5k instead of 20k).

This task should check the following:

  1. Is this enough? Should we change metrics/thresholds?
  2. Check if the new alarm does not create a storm of alerts when a Kafka broker is restarted.

Event Timeline

fdans lowered the priority of this task from Medium to Low.Mar 29 2018, 5:13 PM
fdans moved this task from Wikistats to Operational Excellence on the Analytics board.

Change 443086 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] icing::monitor::analytics: move per host vk alarms to aggregates

https://gerrit.wikimedia.org/r/443086

Tried to review the delivery report per second error rate over the past year, to see the use cases that we'd need to take into consideration to establish a good threshold:

Screen Shot 2018-07-02 at 9.23.10 AM.png (914×2 px, 142 KB)

The first group of errors on the left corner is T172681, a use case that I'd definitely want to catch. Zooming in:

Screen Shot 2018-07-02 at 9.21.56 AM.png (628×2 px, 238 KB)

Screen Shot 2018-07-02 at 9.21.39 AM.png (612×2 px, 307 KB)

In this case it seems that a critical threshold of 5 over the course of 20/30 mins should be enough. The other use cases should be caught as well since they have been definitely louder than this one.

Change 443086 merged by Elukey:
[operations/puppet@production] profile::cache::kafka::alerts: move per host vk alarms to aggregates

https://gerrit.wikimedia.org/r/443086