Page MenuHomePhabricator

Reduce/eliminate false positives for VarnishKafkaNoMessages alert
Closed, ResolvedPublic

Description

Since creating the VarnishKafkaNoMessages alert we have been dogged by false positives.

Improvements have been made in this commit but there are still occasions when the current logic results in unwelcome alerts.

Such occasions may include:

  • Rolling restarts of varnish servers
  • Pooling and depooling of data centres

The Traffic team is aware of this behaviour and sometimes notifies the Data-Engineering team when this work happens.

However, we should ensure that we tune the alerting rules in order to avoid alert fatigue and the potential to overlook genuine incidents.

Event Timeline

Change 887780 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/alerts@master] fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts

https://gerrit.wikimedia.org/r/887780

Change 887780 merged by jenkins-bot:

[operations/alerts@master] fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts

https://gerrit.wikimedia.org/r/887780

False alert has still been reported today in (VarnishkafkaNoMessages) firing: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages

Here we can see the prometheus metric leading to the alert

image.png (426×2 px, 52 KB)

At that time the prometheus doesn't report any of the instances to be down so should not be linked to any down services or restart

Change 887966 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/alerts@master] fix(varnishkafka): add alert duration of 5m to avoid false positive

https://gerrit.wikimedia.org/r/887966

From those graph we can see that no requests have been received on the varnish which leads to no events from varnishkafka which is expected
https://prometheus-ulsfo.wikimedia.org/ops/classic/graph?g0.range_input=1h&g0.end_input=2023-02-09%2010%3A30&g0.expr=rate(rdkafka_producer_topic_partition_msgs%7Binstance%3D~%22cp4044.*%22%7D%5B2m%5D)&g0.tab=0
https://prometheus-ulsfo.wikimedia.org/ops/classic/graph?g0.range_input=1h&g0.end_input=2023-02-09%2010%3A30&g0.expr=rate(rdkafka_producer_topic_partition_msgs%7Binstance%3D~%22cp4044.*%22%7D%5B2m%5D)&g0.tab=0

From the host I can observe that no process have been restarted meanwhile.
Would need to understand what is the cause of this drop (perhaps a kind of depool)?
Meanwhile I propose to apply https://gerrit.wikimedia.org/r/887966 to ensure that we face the threshold during at least 5 min before to raise the alert

The drop is indeed due to a depool

09:09 	<vgutierrez> 	pool cp4044 with ESI testing enabled 	[production]
...
08:57 	<vgutierrez> 	depool cp4044 - T308799

I've looked back at the alerts we have faced on the 7th morning and those ones where due to a roll restart
For that pattern the first change (fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts) was well removing the potential alerts as it well smooth the drop of events/request during only a min
For that "new" pattern pool/depool the drop stay during the whole period of the depool (let's say 10 min) and the moving average can still be affected when we pool but even when we depool.
Indeed during depool the event reported as request is higher than the events send during 1 (see https://grafana-rw.wikimedia.org/d/sciG_j04z/varnishreqvseventssend?orgId=1&from=1675933637788&to=1675933640581). Due to scrapping metrics every 1min

The 5min duration change will ensure that we are not affected by this temporary diff in rate

Change 887966 merged by jenkins-bot:

[operations/alerts@master] fix(varnishkafka): add alert duration of 5m to avoid false positive

https://gerrit.wikimedia.org/r/887966