Page MenuHomePhabricator

Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput
Open, Needs TriagePublic

Description

Hi everybody,

not sure if this is a monitoring tuning problem or something different, but I noticed two warnings on icinga:

Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m
View Extra Service Notes	View Extra Service Notes
WARNING	2020-11-18 09:50:27	3d 18h 1m 39s	5/5	19.63 le 100	

Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m
View Extra Service Notes	View Extra Service Notes
WARNING	2020-11-18 09:50:19	0d 2h 55m 33s	5/5	19.72 le 100

The grafana dashboard shows a sharp dip around 2020-11-14 15:00 UTC, that is a little strange. After some digging this seems to be related to the codfw.change-prop.transcludes.resource-change topic, see the grafana dashboard.

Initially I thought it was related to T267865, so I did a roll restart of all mirror makers on kafka100[1-3], but with more caffeine I realized that the outage happened the day after.

I am a little ignorant about how events end up in codfw's change-prop topics, so this could be a red herring or an expected use case, if so let's tune the alert :)

Event Timeline

elukey created this task.Nov 18 2020, 9:57 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 18 2020, 9:57 AM
elukey renamed this task from Kafka mirror maker codfw -> eqiad in warning state for low throughput to Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput.Nov 18 2020, 9:58 AM
elukey added a subscriber: Ottomata.

A restart of changeprop in codfw fixed this issue - we need to look into why the changeprop subscriber died or stopped processing.

Ok, so this wasn't a MirrorMaker issue then? changeprop was actually producing fewer messages?

Ok, so this wasn't a MirrorMaker issue then? changeprop was actually producing fewer messages?

Seems it has gotten stuck, yeah..