Page MenuHomePhabricator

Add alert for varnishkafka low/zero messages per second to alertmanager
Closed, ResolvedPublic1 Estimated Story Points

Description

As a result of this incident: T300164: Pageview Data loss due to wrong version of package installed on some varnishkafka instances
We have identified a requirement to add an alert for when a particular varnishkafka instance is not successfully sending messages to Kafka.

We currently have alerts configured in https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/cache/kafka/varnishkafka_delivery_alert.pp using Prometheus metrics.

The metric that we wish to cause an alert is rdkafka_producer_topic_partition_msgs and its rate over a period such as 5 minutes.

Update: The check is now in place, but it has two important shortcomings.

1: It cannot detect when a caching proxy host has been intentionally depooled using confctl
2: It cannot detect whan a data centre has been intentionally depooled using GeoDNS

For the first of these I am attempting to add the conftool pooled/depooled status and weight values to prometheus, so that these can be integrated with the check. I am working on this feature in T309189: Add the conftool pooled/depooled status and weight into prometheus for each service

For the second I haven't got a solution at the moment. Maybe get the admin_state values into prometheus as well?

Event Timeline

Prioritizing for after the Catalog implementation.

I don't think this has to do with the data catalog

I'm going to decline this for now in favor of a sister task to look into possibly using etcd config to make robust high level alerts. If that doesn't pan out, we can open this up again. For now, the concern is that this would be hard to configure for other pipelines and might be hard to manage false positives for.

However, Andrew had a good suggestion for false positives which is that this alert could be coordinated with server downtime or depool time so that it doesn't fire when we wouldn't expect it to fire. Keeping that idea here in case we have to come back to it.

Hey @Milimetric, can you add more info about what are the concerns for this task? It is a simple alert, we have similar ones like Kafka delivery errors, it shouldn't be too different. We may have false positives I agree, but we could start adding some alerts and see how it goes, rather than declining.

I saw T304651 and I am not 100% sure that it will replace an alert, since anything that runs batch on the hadoop cluster cannot (in my opinion) replace an alerting system maintained by SRE. If any of the batch job needs to be stopped or doesn't work we may loose monitoring data, meanwhile with an alert this cannot happen.

Change 773801 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Add an alert for zero messages being generated by varnishkafka instances

https://gerrit.wikimedia.org/r/773801

I'm in agreement with @elukey I think. It seems sensible for us to implement this prometheus based check anyway, given that we already have the data.
I've no objection to the belt & braces approach of developing a parallel check with canary events or something, but personally I would implement this first.

So I've gone ahead and created an alert in alertmanager: https://gerrit.wikimedia.org/r/773801 to get us started.

Admittedly I think that this alert will fire when cp hosts are depooled. I'll have to see if there's any way that we can integrate a silence with confctl or similar.

I asked a question in #wikimedia-traffic on IRC about this. It seems that the ideal way to do it would be to get the pooled/depooled status of hosts into prometheus, so that we could integrate this status into the alert. Several people thought that this would be useful.

Tagging @fgiunchedi who might be able to advise further. In the meantime I'd be quite keen to get the alert deployed as-is, without the exclusion based on pooled/depooled status.

Ok, sorry for the noise, and thanks for explaining. Makes sense. I would say that if the false positives get to be too noisy, just tweak it so that they're a lot less likely. Because the other approaches should cover the data loss cases, even if they're sporadically down.

Tagging @fgiunchedi who might be able to advise further. In the meantime I'd be quite keen to get the alert deployed as-is, without the exclusion based on pooled/depooled status.

Indeed the pooled/depooled status as a metric is sth we want for sure (I don't have the bandwidth to work on it ATM). And nowadays should be quite simple to add (happy to assist, out of scope for this task though)

HTH!

Indeed the pooled/depooled status as a metric is sth we want for sure (I don't have the bandwidth to work on it ATM). And nowadays should be quite simple to add (happy to assist, out of scope for this task though)

Great. I'm going to try to work on that. Will add you to any patches.

I asked a question in #wikimedia-traffic on IRC about this. It seems that the ideal way to do it would be to get the pooled/depooled status of hosts into prometheus, so that we could integrate this status into the alert. Several people thought that this would be useful.

Tagging @fgiunchedi who might be able to advise further. In the meantime I'd be quite keen to get the alert deployed as-is, without the exclusion based on pooled/depooled status.

Just to make sure the expectations are correct, one thing is the desired pooled state in etcd, another is if the host if effectively pooled for pybal, as pybal does its own health checks and can depool any host that fails that.

Just to make sure the expectations are correct, one thing is the desired pooled state in etcd, another is if the host if effectively pooled for pybal, as pybal does its own health checks and can depool any host that fails that.

Yes I see. I hadn't picked up on that nuance. I've effectively been looking at implementing the first of those elements, the desired pooled state from etc.

Come to think of it, we can already infer the realized pooled state from the node_ipvs_backend_weight (or other node_ipvs_*) values that we collect already, can't we?

Change 773801 merged by Btullis:

[operations/alerts@master] Add an alert for zero messages being generated by varnishkafka instances

https://gerrit.wikimedia.org/r/773801

The alert fired today, shortly after merging, when @Joe intentially depooled cp1079 in order to re-image it.

image.png (906×1 px, 250 KB)

While this was a false positive result, it was also quite a nice test to make sure that the alert condition works as expected and the dashboard link also works as expected.

I will continue working on the etcd/confd integration with pybal so that we can get the pooled/depooled status into prometheus and use that to remove these false positives.

We had a genuine alert for this today, from cp3050.

image.png (420×1 px, 105 KB)

Investigating now.

Mentioned in SAL (#wikimedia-analytics) [2022-04-01T09:05:50Z] <btullis> restarted varnishkafka-eventlogging.service on cp3050 T300246

Change 776225 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add initial config for pooled status

https://gerrit.wikimedia.org/r/776225

BTullis added a subscriber: Milimetric.

Claiming this ticket back again, to reflect what I'm currently working on. Hope that's OK.
I'm happy to transfer it back again once the prometheus check is fixed, so that we can think about a more comprehensive check as previously discussed.

Change 776912 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Remove the statsv source from the VarnishkafkaNoMessages alert

https://gerrit.wikimedia.org/r/776912

Change 776912 merged by Btullis:

[operations/alerts@master] Remove the statsv source from the VarnishkafkaNoMessages alert

https://gerrit.wikimedia.org/r/776912

BTullis moved this task from In Progress to Paused on the Data-Engineering-Kanban board.

Moving this ticket to paused, while I work on T309189

I might be very naive, but wouldn't it make sense to write a prometheus alert that checks if there are no / few messages in varnishkafka while there is traffic getting to varnish?

So basically your alert would become "rdkafka_producer_topic_partition_msgs is below a certain threshold and the varnish rps is above a certain threshold"

Is there a reason not to go this way?

Specifically I just plotted

sum(irate(rdkafka_producer_topic_partition_msgs{instance="cp3050:9132", source="webrequest"}[5m])) / sum(irate(varnish_requests{instance="cp3050:3903", method!~"PURGE"}[5m]))

and on april 1st, when you've reported a genuine alert happening, the data says that ratio plunged to ~ 0, so it seems like this is a valid way of extracting the signal you want.

Thanks @Joe for the insight and clarity. I think you're quite right and I wasn't seeing the wood for all of the trees.

I can create a prometheus alert based on your suggestion and decline the over-complicated T309189: Add the conftool pooled/depooled status and weight into prometheus for each service

Change 776225 abandoned by Btullis:

[operations/puppet@production] Add a host's conftool pooled status and weight per service to prometheus

Reason:

No longer necessary

https://gerrit.wikimedia.org/r/776225

BTullis set the point value for this task to 1.Aug 3 2022, 4:14 PM

Setting story points to 1 for the remaining work only.

Change 822367 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Update the VarnishKafkaNoMessages alert

https://gerrit.wikimedia.org/r/822367

I now have an alert that I think I'm happy with.

https://gerrit.wikimedia.org/r/c/operations/alerts/+/822367

This compares the rate at which varnish requests arrive with the rate at which varnishkafka submits messages to kafka over a sliding 3 minute window.

If the rate of varnishkafka messages is less than 30% of the rate of varnish requests for a given host, then fire a warning.

If that rate is below 20% then fire a critical alert.

I've also included a test the checks we do not see any false alerts when both rates drop to zero.

Thanks to @Joe for the inspiration.

We can also look at this time series in thanos here.

Change 822367 merged by jenkins-bot:

[operations/alerts@master] Update the VarnishKafkaNoMessages alert

https://gerrit.wikimedia.org/r/822367