- Install new kafka-main[12]00[12345] hosts T223493
- Migrate kafka main services from old to new hardware
- kafka1001 -> kafka-main1001
- kafka
- eventbus
- mirrormaker
- kafka1002 -> kafka-main1002
- kafka
- eventbus
- mirrormaker
- kafka1003 -> kafka-main1003
- kafka
- eventbus
- mirrormaker
- kafka2001 -> kafka-main2001
- kafka
- eventbus
- mirrormaker
- kafka2002 -> kafka-main2002
- kafka
- eventbus
- mirrormaker
- kafka2003 -> kafka-main2003
- kafka
- eventbus
- mirrormaker
- kafka1001 -> kafka-main1001
- Move kafka[12]00[123] to role::spare::system
- kafka1001
- kafka1002
- kafka1003
- kafka2001
- kafka2002
- kafka2003
- Increase cluster size to 5 hosts
- kafka-main1004
- kafka-main1005
- kafka-main2004
- kafka-main2005
- Redistribute data to utilize all hosts
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | herron | T220387 Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) | |||
Resolved | herron | T217359 Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. | |||
Open | None | T225005 Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] |
Event Timeline
Fwiw kafka2001 is the current controller so thinking we should start with kafka2003 -> kafka-main2003
Change 514361 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2003 hardware with kafka-main2003
Here's a first shot at per-host replacement steps for kafka2003 -> kafka-main2003:
https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE/edit?usp=sharing
Please LMK your thoughts, what you think should be added/improved/re-ordered/etc.
Mentioned in SAL (#wikimedia-operations) [2019-06-25T14:43:07Z] <herron> beginning replacement of kafka2003 with kafka-main2003 T225005
Change 514361 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2003 hardware with kafka-main2003
Mentioned in SAL (#wikimedia-operations) [2019-06-25T17:01:26Z] <herron> finished migration of kafka2003 to kafka-main2003 — enabling alert notifications for kafka-main2003, and leaving kafka2003 disabled T225005
Change 519084 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2003 move to role::spare::system
Change 519130 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2002 hardware with kafka-main2002
Change 519084 merged by Herron:
[operations/puppet@production] kafka2003 move to role::spare::system
Mentioned in SAL (#wikimedia-operations) [2019-06-26T14:16:24Z] <herron> beginning replacement of kafka2002 with kafka-main2002 T225005
Change 519130 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2002 hardware with kafka-main2002
Mentioned in SAL (#wikimedia-operations) [2019-06-26T17:52:09Z] <herron> finished migration of kafka2002 to kafka-main2002 — enabling alert notifications for kafka-main2002, and leaving kafka2002 disabled T225005
Change 519271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2002 move to role::spare::system
Change 519271 merged by Herron:
[operations/puppet@production] kafka2002 move to role::spare::system
Change 519273 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka2001 hardware with kafka-main2001
Mentioned in SAL (#wikimedia-operations) [2019-06-27T14:33:16Z] <akosiaris> push newer calico outgoing policy rules. T225005
Mentioned in SAL (#wikimedia-operations) [2019-06-27T14:43:20Z] <herron> beginning replacement of kafka2001 with kafka-main2001 T225005
Change 519273 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka2001 hardware with kafka-main2001
Mentioned in SAL (#wikimedia-operations) [2019-06-27T18:13:43Z] <herron> kafka2001 -> kafka-main2001 migration complete. re-enabling alerting on kafka-main2001, and moving kafka2001 to role::spare::system T225005
Change 519483 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka2001 move to role::spare::system
Change 519483 merged by Herron:
[operations/puppet@production] kafka2001 move to role::spare::system
Hm @herron, today we experienced T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap, which I think is caused by the fact that the eventstreams service has service::node auto_refresh => false. I forgot about this. eventstreams should be depooled, puppet run, and restarted for each new server. Same goes for change-prop, and possibly change-prop-job-queue. Sorry for not catching this when I reviewed the migration plan.
There seem to be other eventstreams problems that may be unrelated (but possibly triggered?) by this change, we are still investigating. CC @elukey
Change 520465 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: add kafka-main200[45] to the codfw cluster
Change 528271 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001
Change 528275 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy
Change 528275 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: add all kafka-main hosts to k8s eventgate policy
Change 528432 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy
Change 528432 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: add all kafka-main hosts to k8s eventgate policy
In addition to the steps in https://phabricator.wikimedia.org/T225005#5292211, we need to change update the list of Kafka brokers for eventgate-main in eqiad.
Change 529428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: replace broker kafka1001 with kafka-main1001
Sounds good. How do the steps and ordering look in https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE ? And what is the process to deploy the new eventgate-main config?
One more thing -- @elukey do you know what the process is to deploy a new eventgate-main config?
@herron I believe this is the documentation for it https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate_chart_change
It references eventgate-analytics, but the process for eventgate-main is 100% the same.
Heya!
Yes, that link from Petr is the right one, just replace any eventgate-analytics with eventgate-main.
The easiest (fewest deployments) thing to do would be to add all of the new broker names to all of the broker lists (including eventgate-main) and deploy, even before they are ready. On startup the Kafka client will just find the use the first broker it finds that responds. Then once all new brokers are up and old ones are down, you can remove the old broker names from the lists and deploy again.
Change 529428 merged by Herron:
[operations/deployment-charts@master] eventgate-main: add new kafka-main brokers to broker list
Change 534472 had a related patch set uploaded (by Herron; owner: Herron):
[operations/deployment-charts@master] eventgate-main: add new brokers to staging broker list
Change 534472 merged by Herron:
[operations/deployment-charts@master] eventgate-main: add new brokers to staging broker list
The eventgate-main config now includes the new brokers in the broker list.
I'll plan to move forward with migrating kafka1001 to kafka-main1001 tomorrow morning (eastern)
Mentioned in SAL (#wikimedia-operations) [2019-09-05T15:23:14Z] <herron> beginning replacement of kafka1001 with kafka-main1001 T225005
Change 528271 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1001 hardware with kafka-main1001
Change 534634 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1001 to role::spare::system
Change 534634 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1001 to role::spare::system
Change 536655 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1002 hardware with kafka-main1002
@herron FYI last week we decommissioned eventlogging-service-eventbus and removed it from puppet in role::kafka::main. So you won't see it provisioned when you run puppet on the new nodes now.
Mentioned in SAL (#wikimedia-operations) [2019-09-16T18:12:58Z] <herron> migrating kafka1002 to kafka-main1002 T225005
Change 536655 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1002 hardware with kafka-main1002
Change 537196 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1002 to role spare system
Change 537196 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1002 to role spare system
Change 537428 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: replace kafka1003 hardware with kafka-main1003
Mentioned in SAL (#wikimedia-operations) [2019-09-17T14:03:27Z] <herron> migrating kafka1003 to kafka-main1003 T225005
Change 537428 merged by Herron:
[operations/puppet@production] kafka-main: replace kafka1003 hardware with kafka-main1003
Change 537490 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-main: move kafka1003 to role spare system
Change 537490 merged by Herron:
[operations/puppet@production] kafka-main: move kafka1003 to role spare system
Change 534633 abandoned by Herron:
kafka-main: move kafka1001 to role::spare::system
Reason:
dupe of I3c8c23efc2b48534adc6e94c9929bb3a9531c72e
Could you please merge/amend/remove the missing cumin alias? https://gerrit.wikimedia.org/r/c/operations/puppet/+/545094 Thanks!