Page MenuHomePhabricator

Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345]
Closed, ResolvedPublic0 Estimated Story Points

Description

  • Install new kafka-main[12]00[12345] hosts T223493
  • Migrate kafka main services from old to new hardware
    • kafka1001 -> kafka-main1001
      • kafka
      • eventbus
      • mirrormaker
    • kafka1002 -> kafka-main1002
      • kafka
      • eventbus
      • mirrormaker
    • kafka1003 -> kafka-main1003
      • kafka
      • eventbus
      • mirrormaker
    • kafka2001 -> kafka-main2001
      • kafka
      • eventbus
      • mirrormaker
    • kafka2002 -> kafka-main2002
      • kafka
      • eventbus
      • mirrormaker
    • kafka2003 -> kafka-main2003
      • kafka
      • eventbus
      • mirrormaker
  • Move kafka[12]00[123] to role::spare::system
    • kafka1001
    • kafka1002
    • kafka1003
    • kafka2001
    • kafka2002
    • kafka2003
  • Increase cluster size to 5 hosts
    • kafka-main1004
    • kafka-main1005
    • kafka-main2004
    • kafka-main2005
  • Redistribute data to utilize all hosts

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+7 -1
operations/homer/publicmaster+16 -0
operations/puppetproduction+14 -10
operations/deployment-chartsmaster+114 -6
operations/puppetproduction+8 -0
operations/puppetproduction+0 -10
operations/puppetproduction+5 -1
operations/puppetproduction+1 -5
operations/puppetproduction+5 -5
operations/puppetproduction+2 -2
operations/puppetproduction+5 -5
operations/puppetproduction+5 -1
operations/puppetproduction+9 -5
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+196 -2
operations/puppetproduction+84 -0
operations/puppetproduction+4 -10
operations/puppetproduction+19 -7
operations/puppetproduction+2 -2
operations/puppetproduction+17 -5
operations/puppetproduction+10 -1
operations/puppetproduction+21 -4
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 528271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001

https://gerrit.wikimedia.org/r/528271

Change 528275 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528275

Change 528275 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528275

Change 528432 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528432

Change 528432 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528432

herron renamed this task from Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] to Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345].Aug 9 2019, 8:25 PM
herron updated the task description. (Show Details)

Change 529428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: replace broker kafka1001 with kafka-main1001

https://gerrit.wikimedia.org/r/529428

Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?

Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?

Hey @Ottomata @elukey, what do you think?

I'm game to begin eqiad migrations this week if possible.

Andrew is on holidays, but it looks good to me!

Andrew is on holidays, but it looks good to me!

Ok! Will plan to migrate kafka1001 -> kafka-main1001 tomorrow morning Eastern time

Andrew is on holidays, but it looks good to me!

Ok! Will plan to migrate kafka1001 -> kafka-main1001 tomorrow morning Eastern time

One more thing -- @elukey do you know what the process is to deploy a new eventgate-main config?

Completely ignorant about it, I'd loop in @jijiki :)

@herron I believe this is the documentation for it https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate_chart_change

It references eventgate-analytics, but the process for eventgate-main is 100% the same.

Heya!

Yes, that link from Petr is the right one, just replace any eventgate-analytics with eventgate-main.

The easiest (fewest deployments) thing to do would be to add all of the new broker names to all of the broker lists (including eventgate-main) and deploy, even before they are ready. On startup the Kafka client will just find the use the first broker it finds that responds. Then once all new brokers are up and old ones are down, you can remove the old broker names from the lists and deploy again.

Change 529428 merged by Herron:
[operations/deployment-charts@master] eventgate-main: add new kafka-main brokers to broker list

https://gerrit.wikimedia.org/r/529428

Change 534472 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: add new brokers to staging broker list

https://gerrit.wikimedia.org/r/534472

Change 534472 merged by Herron:
[operations/deployment-charts@master] eventgate-main: add new brokers to staging broker list

https://gerrit.wikimedia.org/r/534472

The eventgate-main config now includes the new brokers in the broker list.

I'll plan to move forward with migrating kafka1001 to kafka-main1001 tomorrow morning (eastern)

Mentioned in SAL (#wikimedia-operations) [2019-09-05T15:23:14Z] <herron> beginning replacement of kafka1001 with kafka-main1001 T225005

Change 528271 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001

https://gerrit.wikimedia.org/r/528271

Change 534634 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1001 to role::spare::system

https://gerrit.wikimedia.org/r/534634

Change 534634 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1001 to role::spare::system

https://gerrit.wikimedia.org/r/534634

Change 536655 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1002 hardware with kafka-main1002

https://gerrit.wikimedia.org/r/536655

@herron FYI last week we decommissioned eventlogging-service-eventbus and removed it from puppet in role::kafka::main. So you won't see it provisioned when you run puppet on the new nodes now.

Mentioned in SAL (#wikimedia-operations) [2019-09-16T18:12:58Z] <herron> migrating kafka1002 to kafka-main1002 T225005

Change 536655 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1002 hardware with kafka-main1002

https://gerrit.wikimedia.org/r/536655

Change 537196 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1002 to role spare system

https://gerrit.wikimedia.org/r/537196

Change 537196 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1002 to role spare system

https://gerrit.wikimedia.org/r/537196

Change 537428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1003 hardware with kafka-main1003

https://gerrit.wikimedia.org/r/537428

Mentioned in SAL (#wikimedia-operations) [2019-09-17T14:03:27Z] <herron> migrating kafka1003 to kafka-main1003 T225005

Change 537428 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1003 hardware with kafka-main1003

https://gerrit.wikimedia.org/r/537428

Change 537490 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1003 to role spare system

https://gerrit.wikimedia.org/r/537490

Change 537490 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1003 to role spare system

https://gerrit.wikimedia.org/r/537490

Change 534633 abandoned by Herron:
kafka-main: move kafka1001 to role::spare::system

Reason:
dupe of I3c8c23efc2b48534adc6e94c9929bb3a9531c72e

https://gerrit.wikimedia.org/r/534633

@herron in T255973 @razzi is moving partitions to new Kafka Jumbo brokers, and the procedure seems working very well. We still have to move the huge topics like webrequests_text, but so far we didn't spot anything weird.

I'd really love to see kafka-main expanded with the new nodes, it will be way more resilient with 5 (it already happened some weeks ago that in codfw a switch failure caused problems to one broker, and running with only 2 was not great). I think that we could do the following:

  1. Add the new nodes to the eqiad and codfw clusters (should take a code review in theory). They will join it without any partition to manage but they will work fine (Jumbo ran with a similar config for a while).
  2. Plan with @razzi how to migrate partitions between nodes (this could be a goal for next quarter in theory, or something that we do from time to time during the next months).

Let me know :)

That's really exciting! Yes I'd love do see this happen as well, and am on board with the plan that you outlined. Time will be the main constraint for me right now, but yes let's get it started on prep work and then and if necessary can plan out the more time consuming components for the next Q.

Fwiw we are also planning some Kafka upgrades in the logging clusters along that timeframe, so perhaps these Kafka efforts could be combined, or at least happen essentially in parallel.

@herron getting back to this so we can add an OKR for Q4 :)

We could do the following, let me know what you think about it:

  1. Reimage kafka-main100[4,5] and kafka-main200[4,5] to Buster, and add them to the kafka main clusters as new brokers. They will start getting partitions for new topics only.
  2. Reimage the rest of the kafka brokers to debian buster, preserving data, one host at the time. We already did it with Kafka Jumbo.
  3. Use a procedure like T255973 to move partitions to new hosts. @razzi can help in making the plan :)

@herron do you think that we could do this in Q4? I can help if needed :)

@herron ping :) Should we work on this in Q4? I can allocate some time to help, at least to bring the cluster to 5 nodes. Then we can work on moving the kafka topics/partitions with @razzi's help maybe in Q1 2021/22 ?

Also FYI in T271136 Cas is going to add the IPv6 AAAA records for the codfw cluster, including for the new nodes.

Should we work on this in Q4? I can allocate some time to help, at least to bring the cluster to 5 nodes. Then we can work on moving the kafka topics/partitions with @razzi's help maybe in Q1 2021/22 ?

Sounds great! Yes I'm game to work on this in Q4, let's do it. I should be able to start the [12]00[45] reimages in the next week or two.

Out of curiosity, what approach did you take to preserve data one host at a time while upgrading Kafka Jumbo?

Nice!

I have used Stevie's reuse-part partman script:

kafka-jumbo100[1-9]) echo reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg ;; \

The recipe is the following:

d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 ext4 format /boot|2 lvmpv ignore none, \
	/dev/sdb|1 lvmpv ignore none, \
	/dev/mapper/vg--flex-root|1 ext4 format /, \
	/dev/mapper/vg--data-srv|1 ext4 keep /srv

d-i partman-basicfilesystems/no_swap boolean false

Very easy, it can be easily adapted to kafka-main nodes after checking lsblk -i -fs on the nodes (to adjust if needed). There is also a reuse-parts-test.cfg, that does the same but waits in Debian Install for a manual user green light / proceed action, to allow a human review if needed (usually good for the first reimage to make sure that all is good).

@herron ping, we should start working on this :)

Change 682731 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] install_server: kafka-main[12]00[1-5] use default release installer

https://gerrit.wikimedia.org/r/682731

Change 682731 merged by Elukey:

[operations/puppet@production] install_server: kafka-main[12]00[1-5] use default release installer

https://gerrit.wikimedia.org/r/682731

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

kafka-main1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104271709_herron_30719_kafka-main1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-main1004.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

kafka-main1005.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104271810_herron_10811_kafka-main1005_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-main1005.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

kafka-main2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104271912_herron_22766_kafka-main2004_codfw_wmnet.log.

Change 683044 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-main: deploy kafka::main role to kafka-main[12]00[45]

https://gerrit.wikimedia.org/r/683044

Completed auto-reimage of hosts:

['kafka-main2004.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

kafka-main2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104271947_herron_29521_kafka-main2005_codfw_wmnet.log.

Completed auto-reimage of hosts:

['kafka-main2005.codfw.wmnet']

and were ALL successful.

Change 683232 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add new kafka-main IPs to the kafka_brokers_main firewall rules

https://gerrit.wikimedia.org/r/683232

Change 683232 merged by Elukey:

[operations/puppet@production] Add new kafka-main IPs to the kafka_brokers_main firewall rules

https://gerrit.wikimedia.org/r/683232

Change 683706 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] add kafka-main[12]00[45] to existing kafka-main egress rules

https://gerrit.wikimedia.org/r/683706

Change 683706 merged by jenkins-bot:

[operations/deployment-charts@master] add kafka-main[12]00[45] to existing kafka-main egress rules and broker lists

https://gerrit.wikimedia.org/r/683706

Change 683044 merged by Herron:

[operations/puppet@production] kafka-main: deploy kafka::main role to kafka-main[12]00[45]

https://gerrit.wikimedia.org/r/683044

Change 695192 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add kafka-main[12]00[45] to analytics-in{4,6} filters

https://gerrit.wikimedia.org/r/695192

Change 695192 merged by Elukey:

[operations/homer/public@master] Add kafka-main[12]00[45] to analytics-in{4,6} filters

https://gerrit.wikimedia.org/r/695192

Thanks to Keith's work we now have two 5-nodes clusters! \o/

The last step before closing this task is to redistribute topic partitions to take advantage of the new hosts in both clusters. This is a long step, more info about what Razzi did for Jumbo in T255973:

  • use topicmappr to have a list of moves to apply - see T255973#6621849 and T255973#6627751 and T255973#6741950 (see command at the end of this for more info about topicmappr
  • for every file generated, execute something like kafka-reassign-partitions --zookeeper conf1004.eqiad.wmnet,conf1005.eqiad.wmnet,conf1006.eqiad.wmnet/kafka/jumbo-eqiad --reassignment-json-file eqiad.resource_change.part1.json --execute --throttle 10000000 (see T255973#6741950).

@razzi does it make sense overall? Please correct me if I am wrong :)

@herron we could work together to come up with a list of topics to migrate and generate the config files, then it should be a matter of executing them serially one by one until we are done (I can take care of some in my morning and let you do other ones during your daytime, it should cut timings down a lot). What do you think?

@razzi @herron do you think that we can setup a quick meeting to discuss the next steps and how to proceed?

Yes please invite me to a meeting @elukey! Thanks for keeping things moving on this one!

We met today and this is the plan forward:

  1. use topicmappr to create a list of json files containing the new desired state for kafka-main (namely what partitions should go where to rebalance the cluster). By default topicmappr rebalances based on the raw number of partitions, namely ensuring that all brokers at the end will have the same number of partitions. This worked well on Jumbo but it may not work so well for main (say for example that a node gets multiple high traffic partitions from different topics and others don't). It seems a risk but we'll also get feedback from grafana as we proceed with partition moves, so it should be fine to go ahead and correct if needed.
  2. review the plan, alert other SREs and have a wikitech page with commands to execute if something breaks.
  3. slowly move every partition indicated in the json files until we reach the end :D

Thanos graphs for topics with more than 0 msg/s for:

This use case is really interesting since the active datacenter + mirror maker play a big role. For example, codfw is the current active DC for mediawiki and on both main-eqiad and main-codfw the codfw. Jobqueue topics are trending in the top spots. When we'll switch-back to eqiad, I imagine that it will happen the same but with eqiad. topics. Since the number of topics to migrate seems to be not gigantic (plus all topics have 1 to 3 partitions before replication) we could do something like this:

  1. Come up with a list of topics with less than 10msg/s and create a plan for both eqiad and codfw variants.
  2. Start with the main-codfw cluster, and see how traffic changes when mediawiki is in eqiad.
  3. If everything goes well, proceed with topics with more than 10 msg/s (same criteria)

Before proceeding with the other cluster (main-eqiad) we could wait for the switchback to see if what I wrote above about eqiad. vs .codfw topics holds. Thoughts??

Plan looks good to me!

I'll suggest also spinning off a subtask or spreadsheet to keep tabs on topic list/state as we progress through them. Task might be best since we could !log to it and have a concise history afterwords.

Looks great! I put some extra emphasis on the "save current state" step, just in case

@herron +1 for the new task, opening one

Change 520465 abandoned by Razzi:

[operations/puppet@production] kafka-main: add kafka-main200[45] to the codfw cluster

Reason:

This has already been applied in another patch

https://gerrit.wikimedia.org/r/520465