Per #serviceops request, all hosts being imaged and setup by DC Ops will have a sub-task for tracking service ops implementation.
This task is for #service-ops and all questions regarding status of the hosts should direct to parent task T363209. Once that task is resolved this can take place.
Replacement plan (for each node in the cluster):
- Add new kafka nodes to the list of IPs in `kafka_brokers_main` (`hieradata/common.yaml`)
- Run puppet on conf nodes (update zookeeper firewall with the new IPs)
- Ensure that [[ https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=29 | Under Replicated Partitions ]] is 0
- Downtime new and old node
- Silence `KafkaUnderReplicatedPartitions` for the cluster
- Stop kafka and kafka-mirror and disable puppet on the old node:
`sudo cumin kafka-main2001.codfw.wmnet 'disable-puppet "Hardware refresh - T363210"; systemctl stop kafka-mirror.service kafka.service'`
- Assign the kafka-id of the old node to the new node in `hieradata/common.yaml`, assign `kafka::main` role to new node
- Run puppet on the new node
- Roll restart kafka on all other brokers of the cluster (from `sre.kafka.roll-restart-reboot-brokers`) to read updated config:
`/usr/local/bin/kafka-broker-in-sync && systemctl restart kafka && source /etc/profile.d/kafka.sh; kafka preferred-replica-election`
- Wait until [[ https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=29 | Under Replicated Partitions ]] is 0 gain