Page MenuHomePhabricator

False alarms on varnish-http-requests 70% GET drop in 30 min alert
Closed, ResolvedPublic

Description

We're getting flaps of these alerts, since we depooled traffic from the eqiad front edges:

PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: 
                   https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.

Because the site is depooled and not receiving it's normal relatively-large and smooth amount of GET requests, what's left is a fairly small and erratic volume of monitoring/healthcheck requests. Their rate wobbles around enough to occasionally trip the 70% trigger. I'm not exactly sure what the best way is to fix it, but the false alarm is annoying and disconcerting.

Event Timeline

BBlack triaged this task as Medium priority.Aug 9 2018, 4:48 PM
BBlack created this task.

I believe this alert has fired a few times now and most were false positives, also it is not clear what's the actionable. I went ahead and "soft deleted" the alert in Grafana by tweaking its threshold for now, we should revisit the alert itself and the problem it is try to catch. (cc @Krinkle and @ayounsi as potentially interested, from dashboard history)

The main goal of that alert is to be notified if a site suddenly sees its traffic drop, from a network or other issue, but isn't 100% unreachable (external monitors might not trigger).

short term actionable if no maintenance is currently done on the site would be to depool it as it would mean that an outage is ongoing.

This specific dashboard is a way to work around that upstream bug: https://github.com/grafana/grafana/issues/11563 now fixed.

Untested, but a possible fix (if people still think that the alert is relevant) might be to add a 2nd (hidden) graph B:
sum(job_method_status:varnish_requests:rate5m{method="GET",job="varnish-text"}) by (site)
And a 2nd alerting condition:
AND max() OF query(B, 5m, now-1m) IS ABOVE 50 (Replace 50 with a value never reached on depooled site).

If we can avoid false positives I believe the alert has value, also because AIUI a traffic drop might not necessarily result in visible errors on our end?

@ayounsi and I chatted a bit about this and having distinct per-site alerts IMO would help also because it means we can silence individual sites when we know they are depooled. As a side point in the future having a metric per-site like site_pooled e.g. from gdnsd will make it easier to depend in alerts on "pooledness" of a site.

On the condition itself I think instead of an absolute value it'd be nice to be able to say "the site is receiving an x percentage of all traffic"

Change 454613 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Per DC alerting on sudden traffic drop

https://gerrit.wikimedia.org/r/454613

Change 454613 merged by Ayounsi:
[operations/puppet@production] Per DC alerting on sudden traffic drop

https://gerrit.wikimedia.org/r/454613

Mentioned in SAL (#wikimedia-operations) [2018-08-27T18:21:32Z] <XioNoX> merge Per DC alerting on sudden traffic drop (454613) - T201630