Fundraising kafkatee collection broke when Analytics deployed kafka-jumbo100[789]. We'll need to adjust our configurations to accomodate the change, and investigate how much data was lost and whether it was important enough to backfill. Luckily this coincided with work we were doing in FR with campaigns down, so maybe it's a non-issue.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T244211 Analytics Hardware for Fiscal Year 2019/2020 | |||
Resolved | Ottomata | T252675 Add new kafka brokers kafka-jumbo100[789] to the jumbo-eqiad Kafka cluster | |||
Resolved | Jgreen | T254257 adjust fundraising firewalls and kafkatee configuration to accomodate new kafka brokers kafka-jumbo100[789] | |||
Restricted Task |
Event Timeline
Comment Actions
@Jgreen should we add also monitors to prevent this from happening again?
For example, from grafana I see some spikes in kafka consumer lag for frack, but nothing really sustained for hours. What errors did you see on your side? If timeouts, I guess that it may have been due to kafkatee on your side trying to contact 1007-1009 acting as consumer group leaders but then failing, and possibly getting re-assigned to another broker?
Comment Actions
We didn't have any major campaign activity when this happened, so what we noticed was in the firewall logs of dropped connection attempts to the three new hosts. I agree monitoring would be good, but for a predictable change like this advanced notice would be better.