Replacing the kafka-main2001 (T363210) with kafka-mein2006 created an eventgate-main outage, the process used was the following:
Replacement plan (for each node in the cluster):
- Add new kafka node IPs to the list of IPs in `kafka_brokers_main` (`hieradata/common.yaml`)
- Run puppet on conf nodes (update zookeeper firewall with the new IPs)
- Ensure that [[ https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=29 | Under Replicated Partitions ]] is 0
- Downtime new and old node
- Silence `KafkaUnderReplicatedPartitions` for the cluster
- Stop kafka and kafka-mirror and disable puppet on the old broker:
`sudo cumin kafka-main2001.codfw.wmnet 'disable-puppet "Hardware refresh - T363210"; systemctl stop kafka-mirror.service kafka.service'`
- Assign the kafka-id of the old node to the new node in `hieradata/common.yaml`, assign `kafka::main` role to new node
- Run puppet on the new node
- Run puppet on deploy host and deply external-services update to all k8s clusters
- Roll restart kafka on all other brokers of the cluster (from `sre.kafka.roll-restart-reboot-brokers`) to read updated config:
`/usr/local/bin/kafka-broker-in-sync && systemctl restart kafka && source /etc/profile.d/kafka.sh; kafka preferred-replica-election`
- Wait until [[ https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=29 | Under Replicated Partitions ]] is 0 gain
We concluded that there are three things that could be improved in the process above:
# Throttle the bandwidth used to resync the replaced broker to not saturate it's link
# Remove the leadership for all topics from the to be replaced broker before stopping kafka there (and restore after is has fully synced up)
# Don't run `kafka preferred-replica-election` during the roll-restart of the remaining kafka brokers
# Bonus points for not having to restart kafka on the remaining brokers at all
- [[ https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89071740 | ssl.principal.mapping.rules ]] is kafka >=2.4, so it won't work.
- It seems to me like we're using the same cert for authentication and encrypting traffic, thus creating a certificate with the same CN on all brokers and just have that one CN listed in `super.users` is not an option as well (but I absolutely don't understand this good enough).
# In order to avoid resync load on the kafka cluster we could evaluate if it is possible to rsync the data from old node to new node before making the new node a kafka broker (but after stopping kafka on the old node). That would saturate the NICs of those two nodes for some time but while they are not in service. Catching up should be fast then when the new node joins the cluster.
Regarding 1.) we already tested [[ https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas | KIP-73 Replication Quotas ]], namely `follower.replication.throttled.replicas` and `follower.replication.throttled.rate` on kafka-jumbo without success.
```lang=bash
# Add an ACL allowing us to alter the cluster config
kafka acls --add --allow-host '*' --cluster --operation AlterConfigs --allow-principal User:ANONYMOUS
# Set leader.replication.throttled.replicas for broker-id 1015
kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \
--add-config 'follower.replication.throttled.rate=300000000'
# This change was accepted, but not reflected in kafka-config --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--entity-type brokers --entity-name 1015 --describe
# but it was in kafka configs --describe --entity-type brokers | grep 1015
# Setting both, follower.replication.throttled.rate and follower.replication.throttled.replicas also had no visible effect
kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \
--add-config 'follower.replication.throttled.replicas=1015,follower.replication.throttled.rate=2000000'
# Settings could be deleted via
kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \
--delete-config 'follower.replication.throttled.replicas,follower.replication.throttled.rate'
```
Further reads:
- https://docs.confluent.io/platform/current/kafka/post-deployment.html#limiting-bandwidth-usage-during-data-migration
- https://cwiki.apache.org/confluence/display/KAFKA/KIP-542%3A+Partition+Reassignment+Throttling