not sure if this is a monitoring tuning problem or something different, but I noticed two warnings on icinga:
Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m View Extra Service Notes View Extra Service Notes WARNING 2020-11-18 09:50:27 3d 18h 1m 39s 5/5 19.63 le 100 Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m View Extra Service Notes View Extra Service Notes WARNING 2020-11-18 09:50:19 0d 2h 55m 33s 5/5 19.72 le 100
The grafana dashboard shows a sharp dip around 2020-11-14 15:00 UTC, that is a little strange. After some digging this seems to be related to the codfw.change-prop.transcludes.resource-change topic, see the grafana dashboard.
Initially I thought it was related to T267865, so I did a roll restart of all mirror makers on kafka100[1-3], but with more caffeine I realized that the outage happened the day after.
I am a little ignorant about how events end up in codfw's change-prop topics, so this could be a red herring or an expected use case, if so let's tune the alert :)