Maniphest T201630

False alarms on varnish-http-requests 70% GET drop in 30 min alert
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Aug 9 2018, 4:48 PM

Description

We're getting flaps of these alerts, since we depooled traffic from the eqiad front edges:

PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: 
                   https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.

Because the site is depooled and not receiving it's normal relatively-large and smooth amount of GET requests, what's left is a fairly small and erratic volume of monitoring/healthcheck requests. Their rate wobbles around enough to occasionally trip the 70% trigger. I'm not exactly sure what the best way is to fix it, but the false alarm is annoying and disconcerting.

Details

	Subject	Repo	Branch	Lines +/-
	Per DC alerting on sudden traffic drop	operations/puppet	production	+23 -0

Customize query in gerrit

Related Objects

Mentioned In: T291148: VarnishTrafficDrop alert false positives due to DCs depooled

Event Timeline

BBlack triaged this task as Medium priority.Aug 9 2018, 4:48 PM

BBlack created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 9 2018, 4:48 PM

• ema added a project: Traffic.Aug 9 2018, 4:53 PM

• ema subscribed.

• ema moved this task from Backlog to Caching on the Traffic board.Aug 10 2018, 6:25 AM

I believe this alert has fired a few times now and most were false positives, also it is not clear what's the actionable. I went ahead and "soft deleted" the alert in Grafana by tweaking its threshold for now, we should revisit the alert itself and the problem it is try to catch. (cc @Krinkle and @ayounsi as potentially interested, from dashboard history)

The main goal of that alert is to be notified if a site suddenly sees its traffic drop, from a network or other issue, but isn't 100% unreachable (external monitors might not trigger).

short term actionable if no maintenance is currently done on the site would be to depool it as it would mean that an outage is ongoing.

This specific dashboard is a way to work around that upstream bug: https://github.com/grafana/grafana/issues/11563 now fixed.

Untested, but a possible fix (if people still think that the alert is relevant) might be to add a 2nd (hidden) graph B:
sum(job_method_status:varnish_requests:rate5m{method="GET",job="varnish-text"}) by (site)
And a 2nd alerting condition:
AND max() OF query(B, 5m, now-1m) IS ABOVE 50 (Replace 50 with a value never reached on depooled site).

If we can avoid false positives I believe the alert has value, also because AIUI a traffic drop might not necessarily result in visible errors on our end?

@ayounsi and I chatted a bit about this and having distinct per-site alerts IMO would help also because it means we can silence individual sites when we know they are depooled. As a side point in the future having a metric per-site like site_pooled e.g. from gdnsd will make it easier to depend in alerts on "pooledness" of a site.

On the condition itself I think instead of an absolute value it'd be nice to be able to say "the site is receiving an x percentage of all traffic"

Change 454613 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Per DC alerting on sudden traffic drop

https://gerrit.wikimedia.org/r/454613

gerritbot added a project: Patch-For-Review.Aug 22 2018, 6:19 PM

ayounsi claimed this task.Aug 22 2018, 10:00 PM

Change 454613 merged by Ayounsi:
[operations/puppet@production] Per DC alerting on sudden traffic drop

https://gerrit.wikimedia.org/r/454613

Mentioned in SAL (#wikimedia-operations) [2018-08-27T18:21:32Z] <XioNoX> merge Per DC alerting on sudden traffic drop (454613) - T201630

• chasemp subscribed.Aug 27 2018, 6:23 PM

All checks are green: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=traffic+drop
Dashboard alert removed.

• ema mentioned this in T291148: VarnishTrafficDrop alert false positives due to DCs depooled.Sep 16 2021, 7:09 AM

Maintenance_bot removed a project: Patch-For-Review.Sep 16 2021, 7:10 AM

False alarms on varnish-http-requests 70% GET drop in 30 min alertClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

False alarms on varnish-http-requests 70% GET drop in 30 min alert
Closed, ResolvedPublic
Actions