Page MenuHomePhabricator

Replicate current low-message alerting from VarnishKafka
Closed, ResolvedPublic

Description

Currently alertmanager uses the configuration defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-engineering/varnishkafka.yaml to alert in case of low level of message sent by VarnishKafka (compared to the message actually received by Varnish).

The same approach must be replicated for HAProxyKafka too

Event Timeline

Change #1136383 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/alerts@master] data-engineering: duplicating varnishkafka alerts

https://gerrit.wikimedia.org/r/1136383

Gehel triaged this task as High priority.May 6 2025, 12:07 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Change #1136383 merged by Fabfur:

[operations/alerts@master] data-engineering: duplicating varnishkafka alerts

https://gerrit.wikimedia.org/r/1136383

Change #1146516 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/alerts@master] Remove unused varnishkafka configuration

https://gerrit.wikimedia.org/r/1146516

Change #1146516 merged by Fabfur:

[operations/alerts@master] Remove unused varnishkafka configuration

https://gerrit.wikimedia.org/r/1146516

Hi @Fabfur - could you let us know a status update on this one, please?
We're still receiving quite a few alerts relating such as HaproxykafkaNoMessages and AlertLintProblem haproxykafka_saturation_errors and I'm assuming that they're still related to this WIP, rather than actual problems.

Hi @BTullis, this should be related to the fact that we don't have enough datapoints (luckily) for saturation errors so this is triggered. If this becomes a pain because it's triggered too much I can disable the alert as last resource...

Hi @Fabfur - Just as an interesting data point, I noticed that we received a HaproxykafkaNoMessages alert yesterday, which seems to have been associated with a manual depool and repool of cp7001.

I don't know if you think that it's worth looking to tweak the alert definition in this case, or maybe you know why it fired. I just thought that I'd let you know about it, in case you have any ideas. Thanks.

Hi @BTullis sorry for the late answer, I think this fired correctly because being depooled the host produced no haproxykafka messages so, IMHO is the right thing to do. In this case we usually both depool and silence the affected host (if the depool lasts longer than some minutes). IIRC varnishkafka had the same behavior

I think this fired correctly because being depooled the host produced no haproxykafka messages so, IMHO is the right thing to do.

IIRC varnishkafka had the same behavior.

Understood, thanks. I don't think that this was quite the previous behaviour from varnishkafka, at least by the last iteration of the alert definition.
I think that we had got it to the point where we measured the incoming rate of messages to varnish and the outgoing rate of events from varnishkafka, than alerted if a significant difference was present.
This means that the alert wouldn't fire if a host, or whole data centre, was depooled.

The description in the current alert still seems to reflect this:

'Haproxy on {{ $labels.hostname }} is receiving requests, but HaproxyKafka is not sending enough messages'

Do you think that this would still be worth investigation, or do you think that we can automate the silences sufficiently to stop false positives? Or am I making things too difficult for no good reason?