Replacing the kafka-main2001 (T363210) with kafka-mein2006 created an eventgate-main outage, the process used was the following:
Replacement plan (for each node in the cluster):
- Add new kafka node IPs to the list of IPs in `kafka_brokers_main` (`hieradata/common.yaml`)
- Run puppet on conf nodes (update zookeeper firewall with the new IPs)
- Ensure that [[ https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=29 | Under Replicated Partitions ]] is 0
- Downtime new and old node
- Silence `KafkaUnderReplicatedPartitions` for the cluster
- Stop kafka and kafka-mirror and disable puppet on the old broker:
`sudo cumin kafka-main2001.codfw.wmnet 'disable-puppet "Hardware refresh - T363210"; systemctl stop kafka-mirror.service kafka.service'`
- Assign the kafka-id of the old node to the new node in `hieradata/common.yaml`, assign `kafka::main` role to new node
- Run puppet on the new node
- Roll restart kafka on all other brokers of the cluster (from `sre.kafka.roll-restart-reboot-brokers`) to read updated config:
`/usr/local/bin/kafka-broker-in-sync && systemctl restart kafka && source /etc/profile.d/kafka.sh; kafka preferred-replica-election`
- Wait until [[ https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=29 | Under Replicated Partitions ]] is 0 gain
We concluded that there are three things that could be improved in the process above:
# Throttle the bandwidth used to resync the replaced broker to not saturate it's link
# Remove the leadership for all topics from the to be replaced broker before stopping kafka there (and restore after is has fully synced up)
# Don't run `kafka preferred-replica-election` during the roll-restart of the remaining kafka brokers
# Bonus points for not having to restart kafka on the remaining brokers at all
- [[ https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89071740 | ssl.principal.mapping.rules ]] is kafka >=2.4, so it won't work.
- It seems to me like we're using the same cert for authentication and encrypting traffic, thus creating a certificate with the same CN on all brokers and just have that one CN listed in `super.users` is not an option as well (but I absolutely don't understand this good enough).
Regarding 1.) we already tested [[ https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas | KIP-73 Replication Quotas ]], namely `follower.replication.throttled.replicas` and `follower.replication.throttled.rate` on kafka-jumbo without success.
```lang=bash
# Add an ACL allowing us to alter the cluster config
kafka acls --add --allow-host '*' --cluster --operation AlterConfigs --allow-principal User:ANONYMOUS
# Set leader.replication.throttled.replicas for broker-id 1015
kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \
--add-config 'follower.replication.throttled.rate=300000000'
# This change was accepted, but not reflected in kafka-config --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--entity-type brokers --entity-name 1015 --describe
# but it was in kafka configs --describe --entity-type brokers | grep 1015
# Setting both, follower.replication.throttled.rate and follower.replication.throttled.replicas also had no visible effect
kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \
--add-config 'follower.replication.throttled.replicas=1015,follower.replication.throttled.rate=2000000'
# Settings could be deleted via
kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \
--delete-config 'follower.replication.throttled.replicas,follower.replication.throttled.rate'
```
Further reads:
- https://docs.confluent.io/platform/current/kafka/post-deployment.html#limiting-bandwidth-usage-during-data-migration
- https://cwiki.apache.org/confluence/display/KAFKA/KIP-542%3A+Partition+Reassignment+Throttling