Page MenuHomePhabricator

Add new kafka brokers kafka-jumbo100[789] to the jumbo-eqiad Kafka cluster
Closed, ResolvedPublic

Description

Set up Kafka on the new Jumbo Brokers.

Event Timeline

One thing - 1007-9 have buster, so we'll need to adjust puppet to deploy openjdk-8 instead of 11. After this I think it should be a simple apply 1. and 2. :)

Second thing - please double check the hosts as well, I think the partitions are ok but for example /srv is not at its full size yet:

elukey@kafka-jumbo1007:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao---- 357.07g
  srv  vg1 -wi-ao----  17.46t

elukey@kafka-jumbo1007:~$ sudo pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda2  vg0 lvm2 a--  446.34g 89.27g
  /dev/sdb1  vg1 lvm2 a--  <21.83t <4.37t

Given how much space is stored in other brokers I don't think it will be an issue (to expand lvm+ext4 partition afterwards) but let me know your thoughts!

Change 596232 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow kafka brokers in the to talk to eacn other's prometheus jmx exporter

https://gerrit.wikimedia.org/r/596232

Change 596232 merged by Ottomata:
[operations/puppet@production] Allow kafka brokers in the to talk to eacn other's prometheus jmx exporter

https://gerrit.wikimedia.org/r/596232

Before doing any migration, I'm going to some delete old/temp/unused/empty topics:

-l
USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1
USER_REVISION_CREATE_OTTO1
WINDOWED_EDITS_OTTO1
__consumer_offsets
__transaction_state
apiaction
atskafka_test_webrequest_text
connect-test
does_not_exist
edisa.mediawiki.job.xxx
edisa.mediawiki.jobrefreshLinks
eqiad.swift.ottotest17.upload-complete
eventLogging-valid-mixed
eventLogging_valid_mixed
ksql__commands
ksql_query_CTAS_USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1-KSQL_Agg_Query_1515085821960-changelog
ksql_query_CTAS_USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1-KSQL_Agg_Query_1515085821960-repartition
ksql_query_CTAS_USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1-KSQL_Agg_Query_1515085880366-changelog
ksql_query_CTAS_USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1-KSQL_Agg_Query_1515085880366-repartition
ksql_query_CTAS_USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1-KSQL_Agg_Query_1515086320017-changelog
ksql_query_CTAS_USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1-KSQL_Agg_Query_1515086320017-repartition
ksql_query_CTAS_WINDOWED_EDITS_OTTO1-KSQL_Agg_Query_1515087323836-changelog
ksql_query_CTAS_WINDOWED_EDITS_OTTO1-KSQL_Agg_Query_1515087323836-repartition
ksql_query_CTAS_WINDOWED_EDITS_OTTO1-KSQL_Agg_Query_1517342517779-changelog
ksql_query_CTAS_WINDOWED_EDITS_OTTO1-KSQL_Agg_Query_1517342517779-repartition
ksql_transient_191992346661370295_1517342539309-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog
ksql_transient_1938206695746484381_1515087269002-KSQL_Agg_Query_1515087269000-changelog
ksql_transient_1938206695746484381_1515087269002-KSQL_Agg_Query_1515087269000-repartition
ksql_transient_2553420146253236669_1515086157500-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog
ksql_transient_3413554931857454656_1515085910015-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog
ksql_transient_5179313399353440185_1515086454743-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog
ksql_transient_6131579099634200457_1517342471565-KSQL_Agg_Query_1517342471532-changelog
ksql_transient_6131579099634200457_1517342471565-KSQL_Agg_Query_1517342471532-repartition
ksql_transient_7684896129343784672_1517333884318-KSQL_Agg_Query_1517333884302-changelog
ksql_transient_7684896129343784672_1517333884318-KSQL_Agg_Query_1517333884302-repartition
ksql_transient_7774396194685631972_1515086325820-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog
ksql_transient_7953828792787172621_1517334071509-KSQL_Agg_Query_1517334071487-changelog
ksql_transient_7953828792787172621_1517334071509-KSQL_Agg_Query_1517334071487-repartition
ksql_transient_8218116125491884087_1515087342711-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog
ksql_transient_8351551296540040971_1515087208641-KSQL_Agg_Query_1515087208636-changelog
ksql_transient_8351551296540040971_1515087208641-KSQL_Agg_Query_1515087208636-repartition
mediawiki.page-links-chan
mediawiki.revision-create
mediawiki_ApiAction
mediawiki_CirrusSearchRequestSet
otto4
otto5
otto_test5
revision-create
temp_NavigationTiming
temp_NavigationTiming_replay
test
test_otto0
test_otto1
test_otto2
test_otto3
test_otto4
test_otto5
virtualpageview
webrequest
wmf_netflow

Change 597061 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add kafka-jumbo1007 to jumbo-eqiad brokers

https://gerrit.wikimedia.org/r/597061

Change 597061 merged by Ottomata:
[operations/puppet@production] Add kafka-jumbo1007 to jumbo-eqiad brokers

https://gerrit.wikimedia.org/r/597061

Change 597076 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] apt: add thirdparty/confluent to buster-wikimedia

https://gerrit.wikimedia.org/r/597076

Change 597076 merged by Ottomata:
[operations/puppet@production] apt: add thirdparty/confluent to buster-wikimedia

https://gerrit.wikimedia.org/r/597076

Change 597097 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use apt::package_from_component in confluent::kafka::common

https://gerrit.wikimedia.org/r/597097

Change 597097 merged by Ottomata:
[operations/puppet@production] Use apt::package_from_component in confluent::kafka::common

https://gerrit.wikimedia.org/r/597097

Change 597134 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add kafka-jumbo100[89] into the jumbo-eqiad kafka cluster

https://gerrit.wikimedia.org/r/597134

Change 597134 merged by Ottomata:
[operations/puppet@production] Add kafka-jumbo100[89] into the jumbo-eqiad kafka cluster

https://gerrit.wikimedia.org/r/597134

Change 602087 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Add kafka-jumbo100[7-9] to network policy for eventgate-analytics and eventgate-analytics-external

https://gerrit.wikimedia.org/r/602087

Change 602087 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Add kafka-jumbo100[7-9] to network policy for eventgate-analytics and eventgate-analytics-external

https://gerrit.wikimedia.org/r/602087

@elukey @akosiaris @ayounsi I think the Analytics VLAN ACLs need to be adjusted to allow connections to these new hosts. This is currently causing some ingestion issues with newly created EventLogging topics. Would appreciate some help ASAP.

Specifically, Analytics VLAN should allow connections to kafka-jumbo100[1-9] on ports 9092 and 9093. I think 100[7-9] need to be added to this rule.

Thank you!

Change 604810 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] Add kafka-jumbo100[7-9] to analytics-in4 and analytics-in6 filters

https://gerrit.wikimedia.org/r/604810

Change 605098 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] Replace kafka[12]00[123] with kafka-main* in analyitcs-in4/6 filters

https://gerrit.wikimedia.org/r/605098

Change 604810 merged by Elukey:
[operations/homer/public@master] Add kafka-jumbo100[7-9] to analytics-in4 and analytics-in6 filters

https://gerrit.wikimedia.org/r/604810

Change 605098 merged by Elukey:
[operations/homer/public@master] Replace kafka[12]00[123] with kafka-main* in analyitcs-in4/6 filters

https://gerrit.wikimedia.org/r/605098

Mentioned in SAL (#wikimedia-operations) [2020-06-12T08:48:37Z] <elukey> update cr1/cr2 analyitics filters for T252767 and T252675

Change 605225 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add AAAA/PTR records for kafka-jumbo100[7-9]

https://gerrit.wikimedia.org/r/605225

Change 605225 merged by Elukey:
[operations/dns@master] Add AAAA/PTR records for kafka-jumbo100[7-9]

https://gerrit.wikimedia.org/r/605225

I had a chat with Andrew, this task is now to be considered done. We'll move/shuffle partitions for optimal balance across brokers in another task.

elukey set Final Story Points to 8.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.

And, to triple confirm, all new hosts have been added to analytics VLAN?