As a result of this incident: T300164: Pageview Data loss due to wrong version of package installed on some varnishkafka instances
We have identified a requirement to add an alert for when a particular varnishkafka instance is not successfully sending messages to Kafka.
We currently have alerts configured in https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/cache/kafka/varnishkafka_delivery_alert.pp using Prometheus metrics.
The metric that we wish to cause an alert is rdkafka_producer_topic_partition_msgs and its rate over a period such as 5 minutes.
Update: The check is now in place, but it has two important shortcomings.
1: It cannot detect when a caching proxy host has been intentionally depooled using confctl
2: It cannot detect whan a data centre has been intentionally depooled using GeoDNS
For the first of these I am attempting to add the conftool pooled/depooled status and weight values to prometheus, so that these can be integrated with the check. I am working on this feature in T309189: Add the conftool pooled/depooled status and weight into prometheus for each service
For the second I haven't got a solution at the moment. Maybe get the admin_state values into prometheus as well?