We currently graph and alert on Kafka max replica lag if it goes above a hardcoded threshold.
We are currently balancing Kafka partitions in Kafaka jumbo-eqiad to spread the load to new brokers. To do this, Kafka will temporarily add new brokers into a partitions replica list, which will cause them to start consuming from the leader for a partition from its beginning. When it does this, its replica lag suddenly jumps to the max value for that partition (however many messages are currently stored). For large partitions, this can be pretty huge, and cause false alerts to fire.
Instead of alerting on a hardcoded max lag threshold, we should just alert if the lag is increasing over some time window.
I believe this should be possible using the Prometheus deriv function and alerting on positive values, although using offset seems to work too (and more consistently?).