Page MenuHomePhabricator

Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345]
Open, NormalPublic0 Story Points

Description

  • Install new kafka-main[12]00[12345] hosts T223493
  • Migrate kafka main services from old to new hardware
    • kafka1001 -> kafka-main1001
      • kafka
      • eventbus
      • mirrormaker
    • kafka1002 -> kafka-main1002
      • kafka
      • eventbus
      • mirrormaker
    • kafka1003 -> kafka-main1003
      • kafka
      • eventbus
      • mirrormaker
    • kafka2001 -> kafka-main2001
      • kafka
      • eventbus
      • mirrormaker
    • kafka2002 -> kafka-main2002
      • kafka
      • eventbus
      • mirrormaker
    • kafka2003 -> kafka-main2003
      • kafka
      • eventbus
      • mirrormaker
  • Move kafka[12]00[123] to role::spare::system
    • kafka1001
    • kafka1002
    • kafka1003
    • kafka2001
    • kafka2002
    • kafka2003
  • Turn down kafka[12]00[123] hardware (create decom tasks)
  • Increase cluster size to 5 hosts
    • kafka-main1004
    • kafka-main1005
    • kafka-main2004
    • kafka-main2005
  • Redistribute data to utilize all hosts

Event Timeline

herron triaged this task as Normal priority.Jun 4 2019, 5:01 PM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 4 2019, 5:01 PM
herron added a comment.Jun 4 2019, 5:04 PM

Fwiw kafka2001 is the current controller so thinking we should start with kafka2003 -> kafka-main2003

Change 514361 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2003 hardware with kafka-main2003

https://gerrit.wikimedia.org/r/514361

Restricted Application added a project: Analytics. · View Herald TranscriptJun 5 2019, 10:37 AM
herron added a comment.Jun 5 2019, 3:35 PM

Here's a first shot at per-host replacement steps for kafka2003 -> kafka-main2003:

https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE/edit?usp=sharing

Please LMK your thoughts, what you think should be added/improved/re-ordered/etc.

fdans moved this task from Incoming to Radar on the Analytics board.Jun 6 2019, 4:49 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-25T14:43:07Z] <herron> beginning replacement of kafka2003 with kafka-main2003 T225005

Change 514361 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2003 hardware with kafka-main2003

https://gerrit.wikimedia.org/r/514361

Mentioned in SAL (#wikimedia-operations) [2019-06-25T17:01:26Z] <herron> finished migration of kafka2003 to kafka-main2003 — enabling alert notifications for kafka-main2003, and leaving kafka2003 disabled T225005

herron updated the task description. (Show Details)Jun 25 2019, 5:01 PM

Change 519084 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2003 move to role::spare::system

https://gerrit.wikimedia.org/r/519084

Change 519130 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2002 hardware with kafka-main2002

https://gerrit.wikimedia.org/r/519130

herron updated the task description. (Show Details)Jun 25 2019, 9:03 PM

Change 519084 merged by Herron:
[operations/puppet@production] kafka2003 move to role::spare::system

https://gerrit.wikimedia.org/r/519084

Mentioned in SAL (#wikimedia-operations) [2019-06-26T14:16:24Z] <herron> beginning replacement of kafka2002 with kafka-main2002 T225005

Change 519130 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2002 hardware with kafka-main2002

https://gerrit.wikimedia.org/r/519130

Mentioned in SAL (#wikimedia-operations) [2019-06-26T17:52:09Z] <herron> finished migration of kafka2002 to kafka-main2002 — enabling alert notifications for kafka-main2002, and leaving kafka2002 disabled T225005

Change 519271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2002 move to role::spare::system

https://gerrit.wikimedia.org/r/519271

Change 519271 merged by Herron:
[operations/puppet@production] kafka2002 move to role::spare::system

https://gerrit.wikimedia.org/r/519271

herron updated the task description. (Show Details)Jun 26 2019, 6:48 PM

Change 519273 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2001 hardware with kafka-main2001

https://gerrit.wikimedia.org/r/519273

Mentioned in SAL (#wikimedia-operations) [2019-06-27T14:33:16Z] <akosiaris> push newer calico outgoing policy rules. T225005

Mentioned in SAL (#wikimedia-operations) [2019-06-27T14:43:20Z] <herron> beginning replacement of kafka2001 with kafka-main2001 T225005

Change 519273 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2001 hardware with kafka-main2001

https://gerrit.wikimedia.org/r/519273

Mentioned in SAL (#wikimedia-operations) [2019-06-27T18:13:43Z] <herron> kafka2001 -> kafka-main2001 migration complete. re-enabling alerting on kafka-main2001, and moving kafka2001 to role::spare::system T225005

Change 519483 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2001 move to role::spare::system

https://gerrit.wikimedia.org/r/519483

Change 519483 merged by Herron:
[operations/puppet@production] kafka2001 move to role::spare::system

https://gerrit.wikimedia.org/r/519483

Hm @herron, today we experienced T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap, which I think is caused by the fact that the eventstreams service has service::node auto_refresh => false. I forgot about this. eventstreams should be depooled, puppet run, and restarted for each new server. Same goes for change-prop, and possibly change-prop-job-queue. Sorry for not catching this when I reviewed the migration plan.

There seem to be other eventstreams problems that may be unrelated (but possibly triggered?) by this change, we are still investigating. CC @elukey

herron updated the task description. (Show Details)Jul 3 2019, 3:42 PM

Change 520465 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: add kafka-main200[45] to the codfw cluster

https://gerrit.wikimedia.org/r/520465

Change 528271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001

https://gerrit.wikimedia.org/r/528271

Change 528275 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528275

Change 528275 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528275

Change 528432 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528432

Change 528432 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy

https://gerrit.wikimedia.org/r/528432

herron renamed this task from Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] to Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345].Fri, Aug 9, 8:25 PM
herron updated the task description. (Show Details)

Change 529428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: replace broker kafka1001 with kafka-main1001

https://gerrit.wikimedia.org/r/529428

herron added a comment.Fri, Aug 9, 8:40 PM

Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?

Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?

Hey @Ottomata @elukey, what do you think?

I'm game to begin eqiad migrations this week if possible.

Andrew is on holidays, but it looks good to me!

Andrew is on holidays, but it looks good to me!

Ok! Will plan to migrate kafka1001 -> kafka-main1001 tomorrow morning Eastern time

Andrew is on holidays, but it looks good to me!

Ok! Will plan to migrate kafka1001 -> kafka-main1001 tomorrow morning Eastern time

One more thing -- @elukey do you know what the process is to deploy a new eventgate-main config?

elukey added a subscriber: jijiki.Wed, Aug 14, 4:26 PM

Completely ignorant about it, I'd loop in @jijiki :)

Pchelolo added a comment.EditedWed, Aug 14, 4:30 PM

@herron I believe this is the documentation for it https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate_chart_change

It references eventgate-analytics, but the process for eventgate-main is 100% the same.