Page MenuHomePhabricator

Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey
Closed, ResolvedPublic

Description

For the past week or so, the Varnish traffic drop alert for specifically codfw has been noisy: https://logstash.wikimedia.org/goto/1ec92a9c4ab13292b83f21403e7052d1

This does seem to correlate with some odd minute-to-minute spikiness happening to codfw's traffic flow https://w.wiki/ChG which perhaps should be investigated as well

One of the things I think we should do is to add an absolute minimum traffic level required to alert, since a simple ratio will always be subject to this kind of noise. Here's a plot of one way we could express that in PromQL: https://w.wiki/ChE

(We might also want to make the traffic drop alerts based off of ATS metrics and not Varnish frontend ones?)

Details

Event Timeline

Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert

jbond triaged this task as Medium priority.Nov 26 2019, 11:51 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-27T08:24:07Z] <godog> silence codfw varnish traffic drop until dec 9th - T239039

Change 555550 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] traffic drop: require minimum absolute rps

https://gerrit.wikimedia.org/r/555550

Change 555550 merged by CDanis:
[operations/puppet@production] traffic drop: require minimum absolute rps

https://gerrit.wikimedia.org/r/555550

CDanis claimed this task.

Looking at some data in grafana explore, this would have solved most cases of noise in the past few months. So calling it resolved for now.