Modify Kafka max replica lag alert to only alert if increasing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Feb 3 2021, 2:32 AM

Description

We currently graph and alert on Kafka max replica lag if it goes above a hardcoded threshold.

We are currently balancing Kafka partitions in Kafaka jumbo-eqiad to spread the load to new brokers. To do this, Kafka will temporarily add new brokers into a partitions replica list, which will cause them to start consuming from the leader for a partition from its beginning. When it does this, its replica lag suddenly jumps to the max value for that partition (however many messages are currently stored). For large partitions, this can be pretty huge, and cause false alerts to fire.

Instead of alerting on a hardcoded max lag threshold, we should just alert if the lag is increasing over some time window.

I believe this should be possible using the Prometheus deriv function and alerting on positive values, although using offset seems to work too (and more consistently?).

Details

Subject	Repo	Branch	Lines +/-
kafka: Disable alert for absolute max lag value	operations/puppet	production	+1 -108
Expand range of Modify Kafka max replica lag slope alert	operations/puppet	production	+1 -1
Fix Kafka Broker Replica Max Lag is increasing alert	operations/puppet	production	+4 -2
Alert if kafka max replica lag is steadily increasing	operations/puppet	production	+18 -0

Customize query in gerrit

Related Objects

Mentioned Here: T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers

Event Timeline

Ottomata created this task.Feb 3 2021, 2:32 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2021, 2:32 AM

I think this query should do it.

Now we just need to figure out how to make alerts these days...

I added that query as https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&from=1612234354487&to=1612320754487&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&viewPanel=65 in the Kafka dashboard.

In T273702#6798669, @Ottomata wrote:

I think this query should do it.

Now we just need to figure out how to make alerts these days...

To followup from IRC conversation re: alerting rules etc, the usual and supported way is to go via icinga and check_prometheus for expressions such as these. Alternatively for grafana-based alerts we have grafana_alert (via icinga), HTH!

So even though alertmanager is upcoming, should we continue to use check_prometheus? (Can't use grafana_alert, the dashboard is templated.)

@fgiunchedi is there a way to somehow smooth this alert? In this deriv query, the positive spikes are brief and normal. I want to alert if this value stays positive for a longer period of time, like 30 minutes or an hour? Does using min_over_time along with a subquery (as described here) like this min_over_time(deriv(...)) query make sense?

In T273702#6800231, @Ottomata wrote:

So even though alertmanager is upcoming, should we continue to use check_prometheus? (Can't use grafana_alert, the dashboard is templated.)

Yes check_prometheus and Icinga would still be my recommendation at this point. We have most pieces in place for AM, so you can definitely start using it if you are comfortable being in the early adopter stage of the adoption curve. OTOH if you just want to get an alert that works today then check_prometheus is the way to go.

In T273702#6800414, @Ottomata wrote:

@fgiunchedi is there a way to somehow smooth this alert? In this deriv query, the positive spikes are brief and normal. I want to alert if this value stays positive for a longer period of time, like 30 minutes or an hour? Does using min_over_time along with a subquery (as described here) like this min_over_time(deriv(...)) query make sense?

I think in this case the easiest is likely to be keep the query simple (i.e. with deriv only) and ask icinga to transition from SOFT to HARD state for the alert after e.g. 30 min (via retries parameter of monitoring::check_prometheus) if one of the critical/warning thresholds has been breached. HTH!

Change 662005 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Alert if kafka max replica lag is steadily increasing

https://gerrit.wikimedia.org/r/662005

gerritbot added a project: Patch-For-Review.Feb 5 2021, 9:39 PM

Ottomata added a project: Analytics-Kanban.Feb 5 2021, 9:53 PM

Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.

fgiunchedi moved this task from Inbox to Radar on the observability board.Feb 8 2021, 4:15 PM

Change 662005 merged by Ottomata:
[operations/puppet@production] Alert if kafka max replica lag is steadily increasing

https://gerrit.wikimedia.org/r/662005

Change 665191 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix Kafka Broker Replica Max Lag is increasing alert

https://gerrit.wikimedia.org/r/665191

Change 665191 merged by Ottomata:
[operations/puppet@production] Fix Kafka Broker Replica Max Lag is increasing alert

https://gerrit.wikimedia.org/r/665191

@razzi ok FYI I've got this alert going now:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka-jumbo1008&service=Kafka+Broker+Replica+Max+Lag+is+increasing

https://grafana-rw.wikimedia.org/d/000000027/kafka?viewPanel=65&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All

It should only alert if the lag continuously increases for more than 30 minutes. I wanted to get this alert online before you've finished T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers so we can see it in action. I'll leave this as is for the next couple of webrequest partitions you rebalance, and if we think we like it, either remove the existent max lag alert, or just bump its thresholds WAY high.

Also, FYI @herron and @colewhite since this will apply to the Kafka main and logging clusters too.

Ottomata claimed this task.Feb 18 2021, 9:24 PM

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.

Maintenance_bot removed a project: Patch-For-Review.Feb 18 2021, 10:10 PM

@Ottomata I have been seeing in icinga brief UNKNOWNs like the following for various kafka clusters (but they keep repeating):

Weird, and it is very intermittent and seems to happen to all broker checks.

I just ran a bunch of manual check_prometheus_metric.py commands on alert1001, and I did get a single NaN result that a few seconds later re-running was fine:

14:08:23 [@alert1001:/usr/lib/nagios/plugins] $ ./check_prometheus_metric.py --url http://prometheus.svc.eqiad.wmnet/ops  -w 0.0 -c 0.1 -m gt 'scalar(deriv(kafka_server_ReplicaFetcherManager_MaxLag{kafka_cluster="jumbo-eqiad",instance="kafka-jumbo1001:7800"}[2m]))'
NaN

14:08:48 [@alert1001:/usr/lib/nagios/plugins] $ ./check_prometheus_metric.py --url http://prometheus.svc.eqiad.wmnet/ops  -w 0.0 -c 0.1 -m gt 'scalar(deriv(kafka_server_ReplicaFetcherManager_MaxLag{kafka_cluster="jumbo-eqiad",instance="kafka-jumbo1001:7800"}[2m]))'
(C)0.1 gt (W)0 gt 0

Very strange.

My hunch here is that the 2m modifier is quite close to the scape time (1m) so there might not be enough data from time to time, changing to e.g. 5m is worth a try and might avoid the NaN (we do support --nan-ok in check_prometheus_metric though, if that's needed)

Change 666966 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Expand range of Modify Kafka max replica lag slope alert

https://gerrit.wikimedia.org/r/666966

gerritbot added a project: Patch-For-Review.Feb 25 2021, 5:29 PM

Change 666966 merged by Elukey:
[operations/puppet@production] Expand range of Modify Kafka max replica lag slope alert

https://gerrit.wikimedia.org/r/666966

Maintenance_bot removed a project: Patch-For-Review.Feb 26 2021, 9:10 AM

• fdans edited projects, added Analytics-Clusters; removed Analytics.Mar 1 2021, 4:56 PM

Change 667724 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: Disable alert for absolute max lag value and under-replicated partitions

https://gerrit.wikimedia.org/r/667724

gerritbot added a project: Patch-For-Review.Mar 1 2021, 10:51 PM

Change 667724 merged by Razzi:
[operations/puppet@production] kafka: Disable alert for absolute max lag value

https://gerrit.wikimedia.org/r/667724

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2021, 10:10 PM

Ottomata moved this task from Backlog to Done on the Analytics-Clusters board.Mar 9 2021, 5:34 PM

• fdans closed this task as Resolved.Mar 18 2021, 4:01 PM

	F34121909: Screenshot from 2021-02-25 08-01-05.png
	Feb 25 2021, 7:03 AM

Modify Kafka max replica lag alert to only alert if increasingClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Modify Kafka max replica lag alert to only alert if increasing
Closed, ResolvedPublic
Actions