https://kafka.apache.org/documentation/#upgrade_1_1_0
This task is about upgrading Kafka main clusters to 1.x T193778 is about enabling SSL and inter broker encryption after the upgrade is complete.
# Prep Work
[x] Convert Kafka main clusters to use `profile::kafka::broker`
[x] Upgrade Kafka main clusters to Debian Strech and Java 8.
[x] Test upgrade plan in deployment-prep, ensure Kafka clients work there.
[x] On all brokers, set:
```
inter.broker.protocol.version=0.9.0.1
log.message.format.version=0.9.0.1
```
[] Ensure all main MirrorMaker instances stopped via puppet: https://gerrit.wikimedia.org/r/#/c/431588/
NOTE: 1.x version of MirrorMaker will not work when consuming from 0.9 cluster. We will stop all main MirrorMaker instances until upgrade in both DCs is complete.
# production upgrade plan
This upgrade requires 3 rolling restarts of each broker in a Kafka cluster.
For the upgrade:
1. To upgrade the package software
2. To set `inter.broker.protocol.version=1.1.0`
3. To set `log.message.format.version` to the default (1.1.0) and enable SSL port
### main-codfw
#### upgrade
Set downtime and stop puppet on all main-codfw brokers.
```
# On einsteinium:
for h in kafka2001 kafka2002 kafka2003; do
sudo icinga-downtime -d 7200 -r "Kafka upgrade T167039" -h $h
done
# On neodymium:
sudo cumin 'kafka200*' "puppet agent --disable '$USER - Kafka upgrade'"
```
1. For each broker: upgrade and restart Kafka, still using `inter.broker.protocol.version=0.9.0.1`.
```
sudo service kafka stop
sudo apt-get remove confluent-kafka-2.11.7
sudo apt-get install confluent-kafka-2.11
# remove unwanted systemd units and directories:
sudo rm -rv /var/log/confluent /var/lib/kafka /var/lib/zookeeper /lib/systemd/system/confluent*.service && systemctl daemon-reload && systemctl reset-failed
sudo service kafka start
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election
# Now proceed with next broker...
```
2. Merge https://gerrit.wikimedia.org/r/#/c/430449/. For each broker, run puppet to set `inter.broker.protocol.version=1.1.0` and restart Kafka.
```
sudo puppet agent --enable && sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election
# Now proceed with next broker...
```
3. Merge https://gerrit.wikimedia.org/r/#/c/430450/. For each broker, run puppet to set default `log.message.format.version` and restart each broker:
```
sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election
# Now proceed with next broker...
```
Broker upgrade is complete! Remove client specific api.version settings, they are no longer needed for eventbus and statsv.
4. Merge https://gerrit.wikimedia.org/r/#/c/430640/ and restart services:
```
# on eventbus (kafka main) hosts, rolling restart each eventbus service
sudo puppet agent -t
depool && sudo service eventlogging-service-eventbus restart && sleep 3 && pool
# on kafkamon2001
sudo puppet agent -t
sudo service burrow-main-codfw restart
```
### main-eqiad
TODO: Same as above, but with different gerrit patches for main-eqiad
## Post upgrade:
After both clusters are fully upgraded, we remove `api.version` setting for statsv:
Revert https://gerrit.wikimedia.org/r/#/c/429432/1/statsv.py and restart statsv.
Start all main MirrorMaker instances on Kafka 1.1.0. On each cluster and broker, revert //GERRIT_PATCH_TBD// and run puppet.