Page MenuHomePhabricator

Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345]
Open, NormalPublic0 Story Points

Description

  • Install new kafka-main[12]00[12345] hosts T223493
  • Migrate kafka main services from old to new hardware
    • kafka1001 -> kafka-main1001
      • kafka
      • eventbus
      • mirrormaker
    • kafka1002 -> kafka-main1002
      • kafka
      • eventbus
      • mirrormaker
    • kafka1003 -> kafka-main1003
      • kafka
      • eventbus
      • mirrormaker
    • kafka2001 -> kafka-main2001
      • kafka
      • eventbus
      • mirrormaker
    • kafka2002 -> kafka-main2002
      • kafka
      • eventbus
      • mirrormaker
    • kafka2003 -> kafka-main2003
      • kafka
      • eventbus
      • mirrormaker
  • Move kafka[12]00[123] to role::spare::system
    • kafka1001
    • kafka1002
    • kafka1003
    • kafka2001
    • kafka2002
    • kafka2003
  • Turn down kafka[12]00[123] hardware (create decom tasks)
  • Increase cluster size to 5 hosts
    • kafka-main1004
    • kafka-main1005
    • kafka-main2004
    • kafka-main2005
  • Redistribute data to utilize all hosts

Event Timeline

herron triaged this task as Normal priority.Jun 4 2019, 5:01 PM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 4 2019, 5:01 PM
herron added a comment.Jun 4 2019, 5:04 PM

Fwiw kafka2001 is the current controller so thinking we should start with kafka2003 -> kafka-main2003

Change 514361 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2003 hardware with kafka-main2003

https://gerrit.wikimedia.org/r/514361

Restricted Application added a project: Analytics. · View Herald TranscriptJun 5 2019, 10:37 AM
herron added a comment.Jun 5 2019, 3:35 PM

Here's a first shot at per-host replacement steps for kafka2003 -> kafka-main2003:

https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE/edit?usp=sharing

Please LMK your thoughts, what you think should be added/improved/re-ordered/etc.

fdans moved this task from Incoming to Radar on the Analytics board.Jun 6 2019, 4:49 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-25T14:43:07Z] <herron> beginning replacement of kafka2003 with kafka-main2003 T225005

Change 514361 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2003 hardware with kafka-main2003

https://gerrit.wikimedia.org/r/514361

Mentioned in SAL (#wikimedia-operations) [2019-06-25T17:01:26Z] <herron> finished migration of kafka2003 to kafka-main2003 — enabling alert notifications for kafka-main2003, and leaving kafka2003 disabled T225005

herron updated the task description. (Show Details)Jun 25 2019, 5:01 PM

Change 519084 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2003 move to role::spare::system

https://gerrit.wikimedia.org/r/519084

Change 519130 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2002 hardware with kafka-main2002

https://gerrit.wikimedia.org/r/519130

herron updated the task description. (Show Details)Jun 25 2019, 9:03 PM

Change 519084 merged by Herron:
[operations/puppet@production] kafka2003 move to role::spare::system

https://gerrit.wikimedia.org/r/519084

Mentioned in SAL (#wikimedia-operations) [2019-06-26T14:16:24Z] <herron> beginning replacement of kafka2002 with kafka-main2002 T225005

Change 519130 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2002 hardware with kafka-main2002

https://gerrit.wikimedia.org/r/519130

Mentioned in SAL (#wikimedia-operations) [2019-06-26T17:52:09Z] <herron> finished migration of kafka2002 to kafka-main2002 — enabling alert notifications for kafka-main2002, and leaving kafka2002 disabled T225005

Change 519271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2002 move to role::spare::system

https://gerrit.wikimedia.org/r/519271

Change 519271 merged by Herron:
[operations/puppet@production] kafka2002 move to role::spare::system

https://gerrit.wikimedia.org/r/519271

herron updated the task description. (Show Details)Jun 26 2019, 6:48 PM

Change 519273 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2001 hardware with kafka-main2001

https://gerrit.wikimedia.org/r/519273

Mentioned in SAL (#wikimedia-operations) [2019-06-27T14:33:16Z] <akosiaris> push newer calico outgoing policy rules. T225005

Mentioned in SAL (#wikimedia-operations) [2019-06-27T14:43:20Z] <herron> beginning replacement of kafka2001 with kafka-main2001 T225005

Change 519273 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2001 hardware with kafka-main2001

https://gerrit.wikimedia.org/r/519273

Mentioned in SAL (#wikimedia-operations) [2019-06-27T18:13:43Z] <herron> kafka2001 -> kafka-main2001 migration complete. re-enabling alerting on kafka-main2001, and moving kafka2001 to role::spare::system T225005

Change 519483 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2001 move to role::spare::system

https://gerrit.wikimedia.org/r/519483

Change 519483 merged by Herron:
[operations/puppet@production] kafka2001 move to role::spare::system

https://gerrit.wikimedia.org/r/519483

Hm @herron, today we experienced T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap, which I think is caused by the fact that the eventstreams service has service::node auto_refresh => false. I forgot about this. eventstreams should be depooled, puppet run, and restarted for each new server. Same goes for change-prop, and possibly change-prop-job-queue. Sorry for not catching this when I reviewed the migration plan.

There seem to be other eventstreams problems that may be unrelated (but possibly triggered?) by this change, we are still investigating. CC @elukey

herron updated the task description. (Show Details)Jul 3 2019, 3:42 PM

Change 520465 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: add kafka-main200[45] to the codfw cluster

https://gerrit.wikimedia.org/r/520465

Change 528271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001

https://gerrit.wikimedia.org/r/528271

Change 528275 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528275

Change 528275 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528275

Change 528432 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528432

Change 528432 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528432

herron renamed this task from Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] to Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345].Aug 9 2019, 8:25 PM
herron updated the task description. (Show Details)

Change 529428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: replace broker kafka1001 with kafka-main1001

https://gerrit.wikimedia.org/r/529428

herron added a comment.Aug 9 2019, 8:40 PM

Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?

Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?

Hey @Ottomata @elukey, what do you think?

I'm game to begin eqiad migrations this week if possible.

Andrew is on holidays, but it looks good to me!

Andrew is on holidays, but it looks good to me!

Ok! Will plan to migrate kafka1001 -> kafka-main1001 tomorrow morning Eastern time

Andrew is on holidays, but it looks good to me!

Ok! Will plan to migrate kafka1001 -> kafka-main1001 tomorrow morning Eastern time

One more thing -- @elukey do you know what the process is to deploy a new eventgate-main config?

elukey added a subscriber: jijiki.Aug 14 2019, 4:26 PM

Completely ignorant about it, I'd loop in @jijiki :)

Pchelolo added a comment.EditedAug 14 2019, 4:30 PM

@herron I believe this is the documentation for it https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate_chart_change

It references eventgate-analytics, but the process for eventgate-main is 100% the same.

Heya!

Yes, that link from Petr is the right one, just replace any eventgate-analytics with eventgate-main.

The easiest (fewest deployments) thing to do would be to add all of the new broker names to all of the broker lists (including eventgate-main) and deploy, even before they are ready. On startup the Kafka client will just find the use the first broker it finds that responds. Then once all new brokers are up and old ones are down, you can remove the old broker names from the lists and deploy again.

Change 529428 merged by Herron:
[operations/deployment-charts@master] eventgate-main: add new kafka-main brokers to broker list

https://gerrit.wikimedia.org/r/529428

Change 534472 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: add new brokers to staging broker list

https://gerrit.wikimedia.org/r/534472

Change 534472 merged by Herron:
[operations/deployment-charts@master] eventgate-main: add new brokers to staging broker list

https://gerrit.wikimedia.org/r/534472

herron added a comment.Sep 4 2019, 5:28 PM

The eventgate-main config now includes the new brokers in the broker list.

I'll plan to move forward with migrating kafka1001 to kafka-main1001 tomorrow morning (eastern)

Mentioned in SAL (#wikimedia-operations) [2019-09-05T15:23:14Z] <herron> beginning replacement of kafka1001 with kafka-main1001 T225005

Change 528271 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001

https://gerrit.wikimedia.org/r/528271

Change 534634 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1001 to role::spare::system

https://gerrit.wikimedia.org/r/534634

Change 534634 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1001 to role::spare::system

https://gerrit.wikimedia.org/r/534634

herron updated the task description. (Show Details)Sep 5 2019, 5:33 PM

Change 536655 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1002 hardware with kafka-main1002

https://gerrit.wikimedia.org/r/536655

@herron FYI last week we decommissioned eventlogging-service-eventbus and removed it from puppet in role::kafka::main. So you won't see it provisioned when you run puppet on the new nodes now.

@Ottomata excellent thx for the heads up!

Mentioned in SAL (#wikimedia-operations) [2019-09-16T18:12:58Z] <herron> migrating kafka1002 to kafka-main1002 T225005

Change 536655 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1002 hardware with kafka-main1002

https://gerrit.wikimedia.org/r/536655

Change 537196 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1002 to role spare system

https://gerrit.wikimedia.org/r/537196

Change 537196 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1002 to role spare system

https://gerrit.wikimedia.org/r/537196

Change 537428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1003 hardware with kafka-main1003

https://gerrit.wikimedia.org/r/537428

Mentioned in SAL (#wikimedia-operations) [2019-09-17T14:03:27Z] <herron> migrating kafka1003 to kafka-main1003 T225005

Change 537428 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1003 hardware with kafka-main1003

https://gerrit.wikimedia.org/r/537428

Change 537490 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1003 to role spare system

https://gerrit.wikimedia.org/r/537490

Change 537490 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1003 to role spare system

https://gerrit.wikimedia.org/r/537490

herron updated the task description. (Show Details)Tue, Sep 17, 5:33 PM

Change 534633 abandoned by Herron:
kafka-main: move kafka1001 to role::spare::system

Reason:
dupe of I3c8c23efc2b48534adc6e94c9929bb3a9531c72e

https://gerrit.wikimedia.org/r/534633