During an outage of the main-kafka cluster, the producer (Event-Platform) and the consumers (ChangeProp and WMF-JobQueue) were switched to the backup data center in an order of minutes with a simple depool of the primary datacenter for the eventbus. However, for EventStreams that caused an outage since currently both eqiad and codfw instances are configured to listen to events in eqiad, also the switchover required a puppet patch and a puppet run across the cluster. More then that, switching back to eqiad caused a disruption for the clients since the offsets got reset, so clients couldn't reconnect.
We need to think of the ways to improve the situation and allow automatic switchover. Perhaps we could deprecate by-offset reconnection and only support by-timestamp reconnection and lower the expectations about duplicated/lost events during the switchover.