Page MenuHomePhabricator

Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey
Closed, ResolvedPublic

Description

For the past week or so, the Varnish traffic drop alert for specifically codfw has been noisy: https://logstash.wikimedia.org/goto/1ec92a9c4ab13292b83f21403e7052d1

This does seem to correlate with some odd minute-to-minute spikiness happening to codfw's traffic flow https://w.wiki/ChG which perhaps should be investigated as well

One of the things I think we should do is to add an absolute minimum traffic level required to alert, since a simple ratio will always be subject to this kind of noise. Here's a plot of one way we could express that in PromQL: https://w.wiki/ChE

(We might also want to make the traffic drop alerts based off of ATS metrics and not Varnish frontend ones?)

Details

Related Gerrit Patches:
operations/puppet : productiontraffic drop: require minimum absolute rps

Event Timeline

CDanis created this task.Mon, Nov 25, 1:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Nov 25, 1:07 AM
CDanis updated the task description. (Show Details)Mon, Nov 25, 1:09 AM

Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert

CDanis moved this task from Backlog to In progress on the observability board.Mon, Nov 25, 4:07 PM
jbond triaged this task as Medium priority.Tue, Nov 26, 11:51 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-27T08:24:07Z] <godog> silence codfw varnish traffic drop until dec 9th - T239039

Change 555550 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] traffic drop: require minimum absolute rps

https://gerrit.wikimedia.org/r/555550

Change 555550 merged by CDanis:
[operations/puppet@production] traffic drop: require minimum absolute rps

https://gerrit.wikimedia.org/r/555550

CDanis closed this task as Resolved.Mon, Dec 9, 7:16 PM
CDanis claimed this task.

Looking at some data in grafana explore, this would have solved most cases of noise in the past few months. So calling it resolved for now.