Page MenuHomePhabricator

Modify Kafka max replica lag alert to only alert if increasing
Closed, ResolvedPublic

Description

We currently graph and alert on Kafka max replica lag if it goes above a hardcoded threshold.

We are currently balancing Kafka partitions in Kafaka jumbo-eqiad to spread the load to new brokers. To do this, Kafka will temporarily add new brokers into a partitions replica list, which will cause them to start consuming from the leader for a partition from its beginning. When it does this, its replica lag suddenly jumps to the max value for that partition (however many messages are currently stored). For large partitions, this can be pretty huge, and cause false alerts to fire.

Instead of alerting on a hardcoded max lag threshold, we should just alert if the lag is increasing over some time window.

I believe this should be possible using the Prometheus deriv function and alerting on positive values, although using offset seems to work too (and more consistently?).

Event Timeline

I think this query should do it.

Now we just need to figure out how to make alerts these days...

I think this query should do it.

Now we just need to figure out how to make alerts these days...

To followup from IRC conversation re: alerting rules etc, the usual and supported way is to go via icinga and check_prometheus for expressions such as these. Alternatively for grafana-based alerts we have grafana_alert (via icinga), HTH!

So even though alertmanager is upcoming, should we continue to use check_prometheus? (Can't use grafana_alert, the dashboard is templated.)

@fgiunchedi is there a way to somehow smooth this alert? In this deriv query, the positive spikes are brief and normal. I want to alert if this value stays positive for a longer period of time, like 30 minutes or an hour? Does using min_over_time along with a subquery (as described here) like this min_over_time(deriv(...)) query make sense?

So even though alertmanager is upcoming, should we continue to use check_prometheus? (Can't use grafana_alert, the dashboard is templated.)

Yes check_prometheus and Icinga would still be my recommendation at this point. We have most pieces in place for AM, so you can definitely start using it if you are comfortable being in the early adopter stage of the adoption curve. OTOH if you just want to get an alert that works today then check_prometheus is the way to go.

@fgiunchedi is there a way to somehow smooth this alert? In this deriv query, the positive spikes are brief and normal. I want to alert if this value stays positive for a longer period of time, like 30 minutes or an hour? Does using min_over_time along with a subquery (as described here) like this min_over_time(deriv(...)) query make sense?

I think in this case the easiest is likely to be keep the query simple (i.e. with deriv only) and ask icinga to transition from SOFT to HARD state for the alert after e.g. 30 min (via retries parameter of monitoring::check_prometheus) if one of the critical/warning thresholds has been breached. HTH!

Change 662005 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Alert if kafka max replica lag is steadily increasing

https://gerrit.wikimedia.org/r/662005

Change 662005 merged by Ottomata:
[operations/puppet@production] Alert if kafka max replica lag is steadily increasing

https://gerrit.wikimedia.org/r/662005

Change 665191 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix Kafka Broker Replica Max Lag is increasing alert

https://gerrit.wikimedia.org/r/665191

Change 665191 merged by Ottomata:
[operations/puppet@production] Fix Kafka Broker Replica Max Lag is increasing alert

https://gerrit.wikimedia.org/r/665191

@razzi ok FYI I've got this alert going now:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka-jumbo1008&service=Kafka+Broker+Replica+Max+Lag+is+increasing

https://grafana-rw.wikimedia.org/d/000000027/kafka?viewPanel=65&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All

It should only alert if the lag continuously increases for more than 30 minutes. I wanted to get this alert online before you've finished T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers so we can see it in action. I'll leave this as is for the next couple of webrequest partitions you rebalance, and if we think we like it, either remove the existent max lag alert, or just bump its thresholds WAY high.

Also, FYI @herron and @colewhite since this will apply to the Kafka main and logging clusters too.

@Ottomata I have been seeing in icinga brief UNKNOWNs like the following for various kafka clusters (but they keep repeating):

Screenshot from 2021-02-25 08-01-05.png (85×1 px, 31 KB)

Weird, and it is very intermittent and seems to happen to all broker checks.

I just ran a bunch of manual check_prometheus_metric.py commands on alert1001, and I did get a single NaN result that a few seconds later re-running was fine:

14:08:23 [@alert1001:/usr/lib/nagios/plugins] $ ./check_prometheus_metric.py --url http://prometheus.svc.eqiad.wmnet/ops  -w 0.0 -c 0.1 -m gt 'scalar(deriv(kafka_server_ReplicaFetcherManager_MaxLag{kafka_cluster="jumbo-eqiad",instance="kafka-jumbo1001:7800"}[2m]))'
NaN

14:08:48 [@alert1001:/usr/lib/nagios/plugins] $ ./check_prometheus_metric.py --url http://prometheus.svc.eqiad.wmnet/ops  -w 0.0 -c 0.1 -m gt 'scalar(deriv(kafka_server_ReplicaFetcherManager_MaxLag{kafka_cluster="jumbo-eqiad",instance="kafka-jumbo1001:7800"}[2m]))'
(C)0.1 gt (W)0 gt 0

Very strange.

My hunch here is that the 2m modifier is quite close to the scape time (1m) so there might not be enough data from time to time, changing to e.g. 5m is worth a try and might avoid the NaN (we do support --nan-ok in check_prometheus_metric though, if that's needed)

Change 666966 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Expand range of Modify Kafka max replica lag slope alert

https://gerrit.wikimedia.org/r/666966

Change 666966 merged by Elukey:
[operations/puppet@production] Expand range of Modify Kafka max replica lag slope alert

https://gerrit.wikimedia.org/r/666966

Change 667724 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: Disable alert for absolute max lag value and under-replicated partitions

https://gerrit.wikimedia.org/r/667724

Change 667724 merged by Razzi:
[operations/puppet@production] kafka: Disable alert for absolute max lag value

https://gerrit.wikimedia.org/r/667724