Page MenuHomePhabricator

Upgrade Kafka on main cluster with security features
Closed, ResolvedPublic21 Story Points

Description

https://kafka.apache.org/documentation/#upgrade_1_1_0

This task is about upgrading Kafka main clusters to 1.x T193778 is about enabling SSL and inter broker encryption after the upgrade is complete.

Prep Work

  • Convert Kafka main clusters to use profile::kafka::broker
  • Upgrade Kafka main clusters to Debian Strech and Java 8.
  • Test upgrade plan in deployment-prep, ensure Kafka clients work there.
  • On all brokers, set:
inter.broker.protocol.version=0.9.0.1
log.message.format.version=0.9.0.1

production upgrade plan

This upgrade requires 3 rolling restarts of each broker in a Kafka cluster.

For the upgrade:

  1. To upgrade the package software
  2. To set inter.broker.protocol.version=1.1.0
  3. To set log.message.format.version to the default (1.1.0) and enable SSL port

In between restarts 2 and 3, we will update client api.version settings to allow for protocol negotiation.

NOTE: 1.x version of MirrorMaker will not work when consuming from 0.9 cluster. We will stop MirrorMaker instances consuming from 0.9 clusters during upgrade. This means that until both DCs are upgraded, we will not attempt to keep MirrorMaker running. Once the upgrade in both DCs is complete, we will restart MirrorMaker, and it will consume from where it left off.

main-codfw

Stop eqiad -> codfw MirrorMaker instances in codfw via puppet: https://gerrit.wikimedia.org/r/#/c/431588/. Set downtime on related Kafka MirrorMaker main-* alerts defined on einsteinium: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+MirrorMaker

Set downtime and stop puppet on all main-codfw brokers.

# On einsteinium:
for h in kafka2001 kafka2002 kafka2003; do
  sudo icinga-downtime -d 7200 -r "Kafka upgrade T167039" -h $h
done

# On neodymium:
sudo cumin 'kafka200*' "puppet agent --disable '$USER - Kafka upgrade'"
  1. (restart 1) For each broker: upgrade and restart Kafka, still using inter.broker.protocol.version=0.9.0.1.
sudo service kafka stop
sudo apt-get remove confluent-kafka-2.11.7
sudo apt-get install confluent-kafka-2.11
sudo service kafka start
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe  --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election

# Now proceed with next broker...
  1. (restart 2) Merge https://gerrit.wikimedia.org/r/#/c/430449/. For each broker, run puppet to set inter.broker.protocol.version=1.1.0 and restart Kafka.
sudo puppet agent --enable && sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe  --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election

# Now proceed with next broker...
  1. Remove api version setting for clients. Kafka now has the ability to negotiate api versions.

Do each of the following carefully, and ensure that each service is working properly.

Merge https://gerrit.wikimedia.org/r/#/c/430640/ and restart services

# on eventbus (kafka main) hosts, rolling restart each eventbus service
sudo puppet agent -t
depool && sudo service eventlogging-service-eventbus restart && sleep 3 && pool

# on kafkamon2001
sudo puppet agent -t
sudo service burrow-main-codfw restart


# on webperf2001, run puppet to restart statsv without api_version hardcoded.
sudo puppet agent -t

Deploy client api versions for change-prop and jobqueue only in codfw:

  1. (restart 3) Merge https://gerrit.wikimedia.org/r/#/c/430450/. For each broker, run puppet to set default log.message.format.version and restart each broker. This can be done at any time.
sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe  --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election

# Now proceed with next broker...

main-eqiad

Stop codfw -> eqiad MirrorMaker instances in codfw via puppet: https://gerrit.wikimedia.org/r/#/c/431588/. Set downtime on related Kafka MirrorMaker main-* alerts defined on einsteinium: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+MirrorMaker

Set downtime and stop puppet on all main-eqiad brokers.

# On einsteinium:
for h in kafka1001 kafka1002 kafka1003; do
  sudo icinga-downtime -d 7200 -r "Kafka upgrade T167039" -h $h
done

# On neodymium:
sudo cumin 'kafka100*' "puppet agent --disable '$USER - Kafka upgrade'"
  1. (restart 1) For each broker: upgrade and restart Kafka, still using inter.broker.protocol.version=0.9.0.1.
sudo service kafka stop
sudo apt-get remove confluent-kafka-2.11.7
sudo apt-get install confluent-kafka-2.11
sudo service kafka start
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe  --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election

# Now proceed with next broker...
  1. (restart 2) Merge https://gerrit.wikimedia.org/r/#/c/430449/. For each broker, run puppet to set inter.broker.protocol.version=1.1.0 and restart Kafka.
sudo puppet agent --enable && sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe  --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election

# Now proceed with next broker...
  1. Remove api version setting for clients. Kafka now has the ability to negotiate api versions.

Do each of the following carefully, and ensure that each service is working properly.

Merge https://gerrit.wikimedia.org/r/#/c/430640/ and restart services

# on eventbus (kafka main) hosts, rolling restart each eventbus service
sudo puppet agent -t
sudo depool
sudo service eventlogging-service-eventbus restart
sudo pool

# on kafkamon1001
sudo puppet agent -t
sudo service burrow-main-eqiad restart


# on webperf1001, run puppet to restart statsv without api_version hardcoded.
sudo puppet agent -t

# On neodymium: restart statsv varnishkafkas
sudo cumin 'C:profile::cache::kafka::statsv' "run-puppet-agent"

Deploy client api versions for change-prop and jobqueue in eqiad:

  1. (restart 3) Merge https://gerrit.wikimedia.org/r/#/c/430450/. For each broker, run puppet to set default log.message.format.version and restart each broker. This can be done at any time.
sudo puppet agent -t
sudo service kafka restart
# wait until broker is back up and in ISRs, initiate election:
watch "kafka topics --describe  --topic eqiad.mediawiki.revision-create | grep -E 'Isr:.*1001.*$'"
kafka preferred-replica-election

# Now proceed with next broker...

Post upgrade:

Details

Related Gerrit Patches:
operations/puppet : productionRe-enable main-eqiad -> main-codfw MirrorMaker
operations/puppet : productionRe-enable main-codfw -> main-eqiad MirrorMaker
operations/puppet : productionKafka main-eqiad - log.message.format.version
operations/puppet : productionKafka main-eqiad - remove api.version
operations/puppet : productionKafka main-eqiad inter_broker_protocol_version: 1.1.0
operations/puppet : productionStop main-codfw -> main-eqiad MirrorMaker during Kafka main upgrade
operations/puppet : productionKafka main-codfw patch 2
operations/puppet : productionForce statsv varnishkafka api.version to 0.9.0.1
operations/puppet : productionKafka main-codfw - remove api.version
operations/puppet : productionKafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0
operations/puppet : productionStop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade
operations/puppet : productionSet kafka api_version on statsv instance if provided
analytics/statsv : masterMake Kafka api_version configurable
operations/puppet : productionEnsure confluent package systemd units are disabled
operations/puppet : productionno-op: Add $enabled parameter to profile::kafka::mirror
operations/puppet : productionKafka main-codfw patch 4
operations/puppet : productionKafka main-codfw patch 3
operations/puppet : productionSet Rack/row info for Kafka main clusters
operations/puppet : productionNo-op Smart vary security_inter_broker_protocol
operations/puppet : productionNo-op Set inter_broker_protocol_version for main in common hiera
operations/puppet : productionNo-op Move Kafka 0.9.0.1 settings to site specific hiera
operations/puppet : productionNo-op Remove Stretch conditionals for Kafka brokers; all are on Stretch
operations/puppet : productionNo-op Move Kafka version specific configs to site based hiera
operations/puppet : productionNo-op organize kafka broker hiera in prep for main upgrade
operations/puppet : productionTemporarily look up main kafka cluster name for labs testing
labs/private : masterAdd certificates for kafka_test_broker and kafka_main-deployment-prep_broker

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 430432 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] No-op Remove Stretch conditionals for Kafka brokers; all are on Stretch

https://gerrit.wikimedia.org/r/430432

Change 430432 merged by Ottomata:
[operations/puppet@production] No-op Remove Stretch conditionals for Kafka brokers; all are on Stretch

https://gerrit.wikimedia.org/r/430432

Change 430435 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] No-op Move Kafka 0.9.0.1 settings to site specific hiera

https://gerrit.wikimedia.org/r/430435

Change 430435 merged by Ottomata:
[operations/puppet@production] No-op Move Kafka 0.9.0.1 settings to site specific hiera

https://gerrit.wikimedia.org/r/430435

Change 430440 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] No-op Set inter_broker_protocol_version for main in common hiera

https://gerrit.wikimedia.org/r/430440

Change 430440 merged by Ottomata:
[operations/puppet@production] No-op Set inter_broker_protocol_version for main in common hiera

https://gerrit.wikimedia.org/r/430440

Change 430446 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] No-op Smart vary security_inter_broker_protocol

https://gerrit.wikimedia.org/r/430446

Change 430446 merged by Ottomata:
[operations/puppet@production] No-op Smart vary security_inter_broker_protocol

https://gerrit.wikimedia.org/r/430446

Change 430449 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0

https://gerrit.wikimedia.org/r/430449

Change 430450 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-codfw patch 2

https://gerrit.wikimedia.org/r/430450

Change 430451 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-codfw patch 3

https://gerrit.wikimedia.org/r/430451

Ottomata updated the task description. (Show Details)May 2 2018, 8:09 PM
Ottomata updated the task description. (Show Details)

Change 430497 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set Rack/row info for Kafka main clusters

https://gerrit.wikimedia.org/r/430497

Change 430497 merged by Ottomata:
[operations/puppet@production] Set Rack/row info for Kafka main clusters

https://gerrit.wikimedia.org/r/430497

Change 430503 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-codfw patch 4

https://gerrit.wikimedia.org/r/430503

Ottomata updated the task description. (Show Details)May 2 2018, 9:53 PM

Change 430451 abandoned by Ottomata:
Kafka main-codfw patch 3

Reason:
Not doing SSL as part of main kafka upgrade

https://gerrit.wikimedia.org/r/430451

Change 430503 abandoned by Ottomata:
Kafka main-codfw patch 4

Reason:
Not doing SSL as part of main kafka upgrade

https://gerrit.wikimedia.org/r/430503

Change 430640 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-codfw patch 3 - remove api.version

https://gerrit.wikimedia.org/r/430640

Ottomata updated the task description. (Show Details)May 4 2018, 5:18 PM
Ottomata updated the task description. (Show Details)May 7 2018, 1:56 PM
Ottomata updated the task description. (Show Details)May 7 2018, 2:42 PM

Change 431587 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] no-op: Add $enabled parameter to profile::kafka::mirror

https://gerrit.wikimedia.org/r/431587

Change 431588 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Stop all main MirrorMaker during Kafka main upgrade

https://gerrit.wikimedia.org/r/431588

Change 431587 merged by Ottomata:
[operations/puppet@production] no-op: Add $enabled parameter to profile::kafka::mirror

https://gerrit.wikimedia.org/r/431587

Ottomata updated the task description. (Show Details)May 7 2018, 3:38 PM

Change 431599 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Ensure confluent package systemd units are disabled

https://gerrit.wikimedia.org/r/431599

Ottomata updated the task description. (Show Details)May 7 2018, 5:23 PM
Pchelolo updated the task description. (Show Details)May 7 2018, 5:25 PM

Change 431599 merged by Ottomata:
[operations/puppet@production] Ensure confluent package systemd units are disabled

https://gerrit.wikimedia.org/r/431599

Ottomata updated the task description. (Show Details)May 7 2018, 5:32 PM
Ottomata updated the task description. (Show Details)May 7 2018, 5:57 PM
Ottomata updated the task description. (Show Details)May 7 2018, 7:59 PM

Change 431646 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/statsv@master] Make Kafka api_version configurable

https://gerrit.wikimedia.org/r/431646

Change 431651 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set kafka api_version on statsv instance if provided

https://gerrit.wikimedia.org/r/431651

Change 431646 merged by Ottomata:
[analytics/statsv@master] Make Kafka api_version configurable

https://gerrit.wikimedia.org/r/431646

Mentioned in SAL (#wikimedia-operations) [2018-05-07T20:30:28Z] <otto@tin> Started deploy [statsv/statsv@c186340]: Configure api.version via CLI opt -- prep for Kafka main upgrade T167039

Mentioned in SAL (#wikimedia-operations) [2018-05-07T20:30:33Z] <otto@tin> Finished deploy [statsv/statsv@c186340]: Configure api.version via CLI opt -- prep for Kafka main upgrade T167039 (duration: 00m 05s)

Change 431651 merged by Ottomata:
[operations/puppet@production] Set kafka api_version on statsv instance if provided

https://gerrit.wikimedia.org/r/431651

Ottomata updated the task description. (Show Details)May 7 2018, 8:37 PM
Ottomata updated the task description. (Show Details)May 7 2018, 9:11 PM
Pchelolo updated the task description. (Show Details)May 7 2018, 9:44 PM
Ottomata updated the task description. (Show Details)May 8 2018, 2:21 PM
Pchelolo updated the task description. (Show Details)May 8 2018, 3:04 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-08T15:06:27Z] <ottomata> beginnng Kafka upgrade of main-codfw: T167039

Mentioned in SAL (#wikimedia-analytics) [2018-05-08T15:06:31Z] <ottomata> beginnng Kafka upgrade of main-codfw: T167039

Change 431588 merged by Ottomata:
[operations/puppet@production] Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade

https://gerrit.wikimedia.org/r/431588

Change 430449 merged by Ottomata:
[operations/puppet@production] Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0

https://gerrit.wikimedia.org/r/430449

Change 430640 merged by Ottomata:
[operations/puppet@production] Kafka main-codfw - remove api.version

https://gerrit.wikimedia.org/r/430640

Change 431773 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Force statsv varnishkafka api.version to 0.9.0.1

https://gerrit.wikimedia.org/r/431773

Change 431773 merged by Ottomata:
[operations/puppet@production] Force statsv varnishkafka api.version to 0.9.0.1

https://gerrit.wikimedia.org/r/431773

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:01:00Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. Codfw only. T167039

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:01:42Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. Codfw only. T167039 (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:03:30Z] <ppchelko@tin> Started deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. Codfw only. T167039

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:04:32Z] <ppchelko@tin> Finished deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. Codfw only. T167039 (duration: 01m 03s)

Change 430450 merged by Ottomata:
[operations/puppet@production] Kafka main-codfw patch 2

https://gerrit.wikimedia.org/r/430450

Change 431799 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Stop main-codfw -> main-eqiad MirrorMaker during Kafka main upgrade

https://gerrit.wikimedia.org/r/431799

Change 431800 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-eqiad inter_broker_protocol_version: 1.1.0

https://gerrit.wikimedia.org/r/431800

Change 431801 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-eqiad - remove api.version

https://gerrit.wikimedia.org/r/431801

Change 431802 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Kafka main-eqiad - log.message.format.version

https://gerrit.wikimedia.org/r/431802

Ottomata updated the task description. (Show Details)May 8 2018, 5:40 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-09T13:59:21Z] <ottomata> beginning upgrade of Kafka main-eqiad cluster from 0.9.0.1 to 1.1.0 - T167039

Mentioned in SAL (#wikimedia-analytics) [2018-05-09T13:59:25Z] <ottomata> beginning upgrade of Kafka main-eqiad cluster from 0.9.0.1 to 1.1.0 - T167039

Change 431799 merged by Ottomata:
[operations/puppet@production] Stop main-codfw -> main-eqiad MirrorMaker during Kafka main upgrade

https://gerrit.wikimedia.org/r/431799

Change 431800 merged by Ottomata:
[operations/puppet@production] Kafka main-eqiad inter_broker_protocol_version: 1.1.0

https://gerrit.wikimedia.org/r/431800

Change 431801 merged by Ottomata:
[operations/puppet@production] Kafka main-eqiad - remove api.version

https://gerrit.wikimedia.org/r/431801

Mentioned in SAL (#wikimedia-operations) [2018-05-09T14:40:27Z] <ppchelko@tin> Started deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. T167039

Mentioned in SAL (#wikimedia-operations) [2018-05-09T14:41:19Z] <ppchelko@tin> Finished deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. T167039 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2018-05-09T14:45:05Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. T167039

Mentioned in SAL (#wikimedia-operations) [2018-05-09T14:45:37Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. T167039 (duration: 00m 34s)

Change 431802 merged by Ottomata:
[operations/puppet@production] Kafka main-eqiad - log.message.format.version

https://gerrit.wikimedia.org/r/431802

Change 432096 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable main-codfw -> main-eqiad MirrorMaker

https://gerrit.wikimedia.org/r/432096

Change 432096 merged by Ottomata:
[operations/puppet@production] Re-enable main-codfw -> main-eqiad MirrorMaker

https://gerrit.wikimedia.org/r/432096

Change 432118 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable main-eqiad -> main-codfw MirrorMaker

https://gerrit.wikimedia.org/r/432118

Change 432118 merged by Ottomata:
[operations/puppet@production] Re-enable main-eqiad -> main-codfw MirrorMaker

https://gerrit.wikimedia.org/r/432118

Ottomata updated the task description. (Show Details)May 9 2018, 5:11 PM
Ottomata moved this task from Ready to Deploy to Done on the Analytics-Kanban board.
Nuria closed this task as Resolved.Jun 25 2018, 11:14 PM