https://kafka.apache.org/documentation/#upgrade_1_1_0
This task is about upgrading Kafka main clusters to 1.x T193778 is about enabling SSL and inter broker encryption after the upgrade is complete.
# Prep Work
[x] Convert Kafka main clusters to use `profile::kafka::broker`
[x] Upgrade Kafka main clusters to Debian Strech and Java 8.
[x] Test upgrade plan in deployment-prep, ensure Kafka clients work there.
[x] On all brokers, set:
```
inter.broker.protocol.version=0.9.0.1
log.message.format.version=0.9.0.1
```
# production upgrade plan
This upgrade requires 3 rolling restarts of each broker in a Kafka cluster.
For the upgrade:
1. To upgrade the package software
2. To set `inter.broker.protocol.version=1.1.0`
3. To set `log.message.format.version` to the default (1.1.0) and enable SSL port
In between restarts 2 and 3, we will update client api.version settings to allow for protocol negotiation.
[] Ensure all main MirrorMaker instances stopped via puppet: https://gerrit.wikimedia.org/r/#/c/431588/. Set downtime on all Kafka MirrorMaker main-* alerts defined on einsteinium: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+MirrorMaker
NOTE: 1.x version of MirrorMaker will not work when consuming from 0.9 cluster. We will stop all main MirrorMaker instances until upgrade in both DCs is complete.
### main-codfw
#### upgrade
Set downtime and stop puppet on all main-codfw brokers.
```
# On einsteinium:
for h in kafka2001 kafka2002 kafka2003; do
sudo icinga-downtime -d 7200 -r "Kafka upgrade T167039" -h $h
done
# On neodymium:
sudo cumin 'kafka200*' "puppet agent --disable '$USER - Kafka upgrade'"
```
1. (restart 1) For each broker: upgrade and restart Kafka, still using `inter.broker.protocol.version=0.9.0.1`.
```
sudo service kafka stop
sudo apt-get remove confluent-kafka-2.11.7
sudo apt-get install confluent-kafka-2.11
sudo service kafka start
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election
# Now proceed with next broker...
```
2. (restart 2) Merge https://gerrit.wikimedia.org/r/#/c/430449/. For each broker, run puppet to set `inter.broker.protocol.version=1.1.0` and restart Kafka.
```
sudo puppet agent --enable && sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election
# Now proceed with next broker...
```
3. Remove api version setting for clients. Kafka now has the ability to negotiate api versions.
Do each of the following carefully, and ensure that each service is working properly.
Merge https://gerrit.wikimedia.org/r/#/c/430640/ and restart services
```
# on eventbus (kafka main) hosts, rolling restart each eventbus service
sudo puppet agent -t
depool && sudo service eventlogging-service-eventbus restart && sleep 3 && pool
# on kafkamon2001
sudo puppet agent -t
sudo service burrow-main-codfw restart
# on webperf[12]001, run puppet to restart statsv without api_version hardcoded.
sudo puppet agent -t
# On neodymium: restart statsv varnishkafkas
sudo -b 1 -s 5 cumin 'C:profile::cache::kafka::statsv' "run-puppet-agent && sudo service varnishkafka-statsv restart"
```
Update client api versions for change-prop:
//TBD: Petr to fill in//
//TBD: make a patch for statsv.py to make api version setting optionaly instead of hardcoded, see https://gerrit.wikimedia.org/r/#/c/429432/1/statsv.py. Then we can configure per DC.//
4. (restart 3) Merge https://gerrit.wikimedia.org/r/#/c/430450/. For each broker, run puppet to set default `log.message.format.version` and restart each broker. This can be done at any time.
```
sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election
# Now proceed with next broker...
```
### main-eqiad
TODO: Same as above, but with different gerrit patches for main-eqiad
## Post upgrade:
[] Start all main MirrorMaker instances on Kafka 1.1.0. On each cluster and broker, revert //GERRIT_PATCH_TBD// and run puppet.