Reduce/eliminate false positives for VarnishKafkaNoMessages alert
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Dec 5 2022, 11:08 PM

Description

Since creating the VarnishKafkaNoMessages alert we have been dogged by false positives.

Improvements have been made in this commit but there are still occasions when the current logic results in unwelcome alerts.

Such occasions may include:

Rolling restarts of varnish servers
Pooling and depooling of data centres

The Traffic team is aware of this behaviour and sometimes notifies the Data-Engineering team when this work happens.

However, we should ensure that we tune the alerting rules in order to avoid alert fatigue and the potential to overlook genuine incidents.

Details

	Subject	Repo	Branch	Lines +/-
	fix(varnishkafka): add alert duration of 5m to avoid false positive	operations/alerts	master	+32 -18
	fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts	operations/alerts	master	+5 -5

Customize query in gerrit

Event Timeline

BTullis created this task.Dec 5 2022, 11:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 5 2022, 11:08 PM

BTullis moved this task from Backlog to Ops Week on the Data-Engineering-Planning board.Dec 5 2022, 11:08 PM

Maintenance_bot added a project: SRE.Dec 5 2022, 11:29 PM

• EChetty moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.Feb 6 2023, 10:09 AM

• nfraison claimed this task.Feb 8 2023, 9:32 AM

• nfraison subscribed.

Change 887780 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/alerts@master] fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts

https://gerrit.wikimedia.org/r/887780

gerritbot added a project: Patch-For-Review.Feb 8 2023, 12:51 PM

• EChetty edited projects, added Shared-Data-Infrastructure (2022-23 Q4 Wrap up); removed Shared-Data-Infrastructure.Feb 8 2023, 1:43 PM

• EChetty moved this task from Next Up to In Review on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.

Change 887780 merged by jenkins-bot:

[operations/alerts@master] fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts

https://gerrit.wikimedia.org/r/887780

Maintenance_bot removed a project: Patch-For-Review.Feb 8 2023, 3:31 PM

• nfraison moved this task from In Review to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 8 2023, 5:06 PM

• nfraison moved this task from Done to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 9 2023, 9:13 AM

False alert has still been reported today in (VarnishkafkaNoMessages) firing: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages

Here we can see the prometheus metric leading to the alert

At that time the prometheus doesn't report any of the instances to be down so should not be linked to any down services or restart

Change 887966 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/alerts@master] fix(varnishkafka): add alert duration of 5m to avoid false positive

https://gerrit.wikimedia.org/r/887966

gerritbot added a project: Patch-For-Review.Feb 9 2023, 10:59 AM

From those graph we can see that no requests have been received on the varnish which leads to no events from varnishkafka which is expected
https://prometheus-ulsfo.wikimedia.org/ops/classic/graph?g0.range_input=1h&g0.end_input=2023-02-09%2010%3A30&g0.expr=rate(rdkafka_producer_topic_partition_msgs%7Binstance%3D~%22cp4044.*%22%7D%5B2m%5D)&g0.tab=0
https://prometheus-ulsfo.wikimedia.org/ops/classic/graph?g0.range_input=1h&g0.end_input=2023-02-09%2010%3A30&g0.expr=rate(rdkafka_producer_topic_partition_msgs%7Binstance%3D~%22cp4044.*%22%7D%5B2m%5D)&g0.tab=0

From the host I can observe that no process have been restarted meanwhile.
Would need to understand what is the cause of this drop (perhaps a kind of depool)?
Meanwhile I propose to apply https://gerrit.wikimedia.org/r/887966 to ensure that we face the threshold during at least 5 min before to raise the alert

The drop is indeed due to a depool

09:09 	<vgutierrez> 	pool cp4044 with ESI testing enabled 	[production]
...
08:57 	<vgutierrez> 	depool cp4044 - T308799

I've looked back at the alerts we have faced on the 7th morning and those ones where due to a roll restart
For that pattern the first change (fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts) was well removing the potential alerts as it well smooth the drop of events/request during only a min
For that "new" pattern pool/depool the drop stay during the whole period of the depool (let's say 10 min) and the moving average can still be affected when we pool but even when we depool.
Indeed during depool the event reported as request is higher than the events send during 1 (see https://grafana-rw.wikimedia.org/d/sciG_j04z/varnishreqvseventssend?orgId=1&from=1675933637788&to=1675933640581). Due to scrapping metrics every 1min

The 5min duration change will ensure that we are not affected by this temporary diff in rate

• nfraison moved this task from In Progress to In Review on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 9 2023, 12:30 PM

Change 887966 merged by jenkins-bot:

[operations/alerts@master] fix(varnishkafka): add alert duration of 5m to avoid false positive

https://gerrit.wikimedia.org/r/887966

Maintenance_bot removed a project: Patch-For-Review.Feb 10 2023, 8:30 AM

• nfraison moved this task from In Review to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 13 2023, 1:32 PM

JArguello-WMF closed this task as Resolved.Feb 21 2023, 1:48 PM

	F36817265: image.png
	Feb 9 2023, 9:42 AM

Reduce/eliminate false positives for VarnishKafkaNoMessages alertClosed, ResolvedPublicActions

Description

Details

Event Timeline

Reduce/eliminate false positives for VarnishKafkaNoMessages alert
Closed, ResolvedPublic
Actions