https://kafka.apache.org/documentation/#upgrade_1_1_0
This task is about upgrading Kafka main clusters to 1.x T193778 is about enabling SSL and inter broker encryption after the upgrade is complete.
Prep Work
- Convert Kafka main clusters to use profile::kafka::broker
- Upgrade Kafka main clusters to Debian Strech and Java 8.
- Test upgrade plan in deployment-prep, ensure Kafka clients work there.
- On all brokers, set:
inter.broker.protocol.version=0.9.0.1 log.message.format.version=0.9.0.1
production upgrade plan
This upgrade requires 3 rolling restarts of each broker in a Kafka cluster.
For the upgrade:
- To upgrade the package software
- To set inter.broker.protocol.version=1.1.0
- To set log.message.format.version to the default (1.1.0) and enable SSL port
In between restarts 2 and 3, we will update client api.version settings to allow for protocol negotiation.
main-codfw
Stop eqiad -> codfw MirrorMaker instances in codfw via puppet: https://gerrit.wikimedia.org/r/#/c/431588/. Set downtime on related Kafka MirrorMaker main-* alerts defined on einsteinium: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+MirrorMaker
Set downtime and stop puppet on all main-codfw brokers.
# On einsteinium: for h in kafka2001 kafka2002 kafka2003; do sudo icinga-downtime -d 7200 -r "Kafka upgrade T167039" -h $h done # On neodymium: sudo cumin 'kafka200*' "puppet agent --disable '$USER - Kafka upgrade'"
- (restart 1) For each broker: upgrade and restart Kafka, still using inter.broker.protocol.version=0.9.0.1.
sudo service kafka stop sudo apt-get remove confluent-kafka-2.11.7 sudo apt-get install confluent-kafka-2.11 sudo service kafka start # wait until broker is back up and in ISRs, initiate election: watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'" kafka preferred-replica-election # Now proceed with next broker...
- (restart 2) Merge https://gerrit.wikimedia.org/r/#/c/430449/. For each broker, run puppet to set inter.broker.protocol.version=1.1.0 and restart Kafka.
sudo puppet agent --enable && sudo puppet agent -t sudo service kafka restart # wait until broker is back up and in ISRs, initiate election: watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'" kafka preferred-replica-election # Now proceed with next broker...
- Remove api version setting for clients. Kafka now has the ability to negotiate api versions.
Do each of the following carefully, and ensure that each service is working properly.
Merge https://gerrit.wikimedia.org/r/#/c/430640/ and restart services
# on eventbus (kafka main) hosts, rolling restart each eventbus service sudo puppet agent -t depool && sudo service eventlogging-service-eventbus restart && sleep 3 && pool # on kafkamon2001 sudo puppet agent -t sudo service burrow-main-codfw restart # on webperf2001, run puppet to restart statsv without api_version hardcoded. sudo puppet agent -t
Deploy client api versions for change-prop and jobqueue only in codfw:
- Change-Prop: https://gerrit.wikimedia.org/r/#/c/431763/
- Job Queue: https://gerrit.wikimedia.org/r/#/c/431764/
- (restart 3) Merge https://gerrit.wikimedia.org/r/#/c/430450/. For each broker, run puppet to set default log.message.format.version and restart each broker. This can be done at any time.
sudo puppet agent -t sudo service kafka restart # wait until broker is back up and in ISRs, initiate election: watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'" kafka preferred-replica-election # Now proceed with next broker...
main-eqiad
Stop codfw -> eqiad MirrorMaker instances in codfw via puppet: https://gerrit.wikimedia.org/r/#/c/431588/. Set downtime on related Kafka MirrorMaker main-* alerts defined on einsteinium: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+MirrorMaker
Set downtime and stop puppet on all main-eqiad brokers.
# On einsteinium: for h in kafka1001 kafka1002 kafka1003; do sudo icinga-downtime -d 7200 -r "Kafka upgrade T167039" -h $h done # On neodymium: sudo cumin 'kafka100*' "puppet agent --disable '$USER - Kafka upgrade'"
- (restart 1) For each broker: upgrade and restart Kafka, still using inter.broker.protocol.version=0.9.0.1.
sudo service kafka stop sudo apt-get remove confluent-kafka-2.11.7 sudo apt-get install confluent-kafka-2.11 sudo service kafka start # wait until broker is back up and in ISRs, initiate election: watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'" kafka preferred-replica-election # Now proceed with next broker...
- (restart 2) Merge https://gerrit.wikimedia.org/r/#/c/430449/. For each broker, run puppet to set inter.broker.protocol.version=1.1.0 and restart Kafka.
sudo puppet agent --enable && sudo puppet agent -t sudo service kafka restart # wait until broker is back up and in ISRs, initiate election: watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'" kafka preferred-replica-election # Now proceed with next broker...
- Remove api version setting for clients. Kafka now has the ability to negotiate api versions.
Do each of the following carefully, and ensure that each service is working properly.
Merge https://gerrit.wikimedia.org/r/#/c/430640/ and restart services
# on eventbus (kafka main) hosts, rolling restart each eventbus service sudo puppet agent -t sudo depool sudo service eventlogging-service-eventbus restart sudo pool # on kafkamon1001 sudo puppet agent -t sudo service burrow-main-eqiad restart # on webperf1001, run puppet to restart statsv without api_version hardcoded. sudo puppet agent -t # On neodymium: restart statsv varnishkafkas sudo cumin 'C:profile::cache::kafka::statsv' "run-puppet-agent"
Deploy client api versions for change-prop and jobqueue in eqiad:
- Change-Prop: https://gerrit.wikimedia.org/r/#/c/431763/
- Job Queue: https://gerrit.wikimedia.org/r/#/c/431764/
- (restart 3) Merge https://gerrit.wikimedia.org/r/#/c/430450/. For each broker, run puppet to set default log.message.format.version and restart each broker. This can be done at any time.
sudo puppet agent -t sudo service kafka restart # wait until broker is back up and in ISRs, initiate election: watch "kafka topics --describe --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'" kafka preferred-replica-election # Now proceed with next broker...
Post upgrade:
- Start all main MirrorMaker instances on Kafka 1.1.0. Revert https://gerrit.wikimedia.org/r/#/c/431588/ and then run puppet on each cluster and broker.