Replacing the kafka-main2001 (T363210) with kafka-mein2006 created an eventgate-main outage, the process used was the following:
Replacement plan (for each node in the cluster):
- Add new kafka node IPs to the list of IPs in kafka_brokers_main (hieradata/common.yaml)
- Run puppet on conf nodes (update zookeeper firewall with the new IPs)
- Ensure that Under Replicated Partitions is 0
- Downtime new and old node
- Silence KafkaUnderReplicatedPartitions for the cluster
- Stop kafka and kafka-mirror and disable puppet on the old broker:
sudo cumin kafka-main2001.codfw.wmnet 'disable-puppet "Hardware refresh - T363210"; systemctl stop kafka-mirror.service kafka.service'
- Assign the kafka-id of the old node to the new node in hieradata/common.yaml, assign kafka::main role to new node
- Run puppet on the new node
- Run puppet on deploy host and deply external-services update to all k8s clusters
- Roll restart kafka on all other brokers of the cluster (from sre.kafka.roll-restart-reboot-brokers) to read updated config:
/usr/local/bin/kafka-broker-in-sync && systemctl restart kafka && source /etc/profile.d/kafka.sh; kafka preferred-replica-election
- Wait until Under Replicated Partitions is 0 gain
- Remove the old node from various kafka connection strings and deploy that change (like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064758)
We concluded that there are three things that could be improved in the process above:
- Throttle the bandwidth used to resync the replaced broker to not saturate it's link
- Remove the leadership for all topics from the to be replaced broker before stopping kafka there (and restore after is has fully synced up)
- Don't run kafka preferred-replica-election during the roll-restart of the remaining kafka brokers
- Bonus points for not having to restart kafka on the remaining brokers at all
- ssl.principal.mapping.rules is kafka >=2.4, so it won't work.
- It seems to me like we're using the same cert for authentication and encrypting traffic, thus creating a certificate with the same CN on all brokers and just have that one CN listed in super.users is not an option as well (but I absolutely don't understand this good enough).
- Bonus points for not having to restart kafka on the remaining brokers at all
- In order to avoid resync load on the kafka cluster we could evaluate if it is possible to rsync the data from old node to new node before making the new node a kafka broker (but after stopping kafka on the old node). That would saturate the NICs of those two nodes for some time but while they are not in service. Catching up should be fast then when the new node joins the cluster.
Regarding 1.) we already tested KIP-73 Replication Quotas, namely follower.replication.throttled.replicas and follower.replication.throttled.rate on kafka-jumbo without success.
# Add an ACL allowing us to alter the cluster config kafka acls --add --allow-host '*' --cluster --operation AlterConfigs --allow-principal User:ANONYMOUS # Set leader.replication.throttled.replicas for broker-id 1015 kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \ --add-config 'follower.replication.throttled.rate=300000000' # This change was accepted, but not reflected in kafka-config --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \ --entity-type brokers --entity-name 1015 --describe # but it was in kafka configs --describe --entity-type brokers | grep 1015 # Setting both, follower.replication.throttled.rate and follower.replication.throttled.replicas also had no visible effect kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \ --add-config 'follower.replication.throttled.replicas=1015,follower.replication.throttled.rate=2000000' # Settings could be deleted via kafka-configs --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS --entity-type brokers --entity-name 1015 --alter \ --delete-config 'follower.replication.throttled.replicas,follower.replication.throttled.rate'
Further reads: