Page MenuHomePhabricator

Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts
Closed, ResolvedPublic

Description

Kafka-logging hardware is online now, time to migrate the brokers colocated on the logstash nodes to to dedicated hardware.

  • logstash1010 -> kafka-logging1001
  • logstash1011 -> kafka-logging1002
  • logstash1012 -> kafka-logging1003
  • logstash2001 -> kafka-logging2001
  • logstash2002 -> kafka-logging2002
  • logstash2003 -> kafka-logging2003

Event Timeline

herron triaged this task as Medium priority.Apr 5 2021, 7:40 PM
herron created this task.

Change 677009 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: migrate broker logstash1010 to kafka-logging1001

https://gerrit.wikimedia.org/r/677009

Mentioned in SAL (#wikimedia-operations) [2021-04-13T15:26:15Z] <herron> migrating kafka-logging broker logstash1010 to kafka-logging1001 T279342

Change 677009 merged by Herron:

[operations/puppet@production] kafka-logging: migrate broker logstash1010 to kafka-logging1001

https://gerrit.wikimedia.org/r/677009

Change 679411 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: migrate broker logstash1011 to kafka-logging1002

https://gerrit.wikimedia.org/r/679411

Mentioned in SAL (#wikimedia-operations) [2021-04-14T19:42:04Z] <herron> migrating kafka-logging broker logstash1011 to kafka-logging1002 T279342

Change 679411 merged by Herron:

[operations/puppet@production] kafka-logging: migrate broker logstash1011 to kafka-logging1002

https://gerrit.wikimedia.org/r/679411

Change 679740 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add kafka-logging100{2,3} IPs to the kafka term of analytics filters

https://gerrit.wikimedia.org/r/679740

@herron the move to the new brokers is causing some troubles to analytics and wmcs, since the IPs are not listed in the VLAN firewall filters. For example:

https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=20&orgId=1&refresh=5m&from=now-24h&to=now&var-server=cloudvirt1012&var-datasource=eqiad%20prometheus%2Fops&var-cluster=wmcs

I added 1001 yesterday, and filed https://gerrit.wikimedia.org/r/679740 today to add 1002/1003, but we should have done it before starting :D

Pinging also @aborrero @dcaro, they will probably make a similar change to homer_public.

Also for some reason the AAAA records are not present in our Auth DNS servers, but I see it listed in Netbox..

Change 679740 merged by Elukey:

[operations/homer/public@master] Add kafka-logging100{2,3} IPs to the kafka term of analytics filters

https://gerrit.wikimedia.org/r/679740

Hey @elukey thanks for taking care of that! Was not aware that needed to happen ahead of migrating brokers, TIL. Have added that to the migration process as an early step. I'll upload a similar patch to cover the codfw broker migrations.

With regard to AAAA records, this is intentional for now. There is T271138 which includes enabling IPv6 on the logging cluster. Planning to tackle that independent from the broker migrations.

Change 679862 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/homer/public@master] cr/firewall: add kafka-logging servers to labs-in filters

https://gerrit.wikimedia.org/r/679862

Change 679875 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: migrake broker logstash1012 to kafka-logging1003

https://gerrit.wikimedia.org/r/679875

Mentioned in SAL (#wikimedia-operations) [2021-04-15T20:03:34Z] <herron> migrating kafka-logging broker logstash1012 to kafka-logging1003 T279342

Change 679875 merged by Herron:

[operations/puppet@production] kafka-logging: migrake broker logstash1012 to kafka-logging1003

https://gerrit.wikimedia.org/r/679875

Change 679862 merged by jenkins-bot:

[operations/homer/public@master] cr/firewall: add kafka-logging servers to labs-in filters

https://gerrit.wikimedia.org/r/679862

Mentioned in SAL (#wikimedia-operations) [2021-04-16T10:44:48Z] <arturo> merging homer change to cr-eqiad (T279342)

Thanks @elukey and @ayounsi, all is apparently done on the WMCS side. Ping me if something else is needed.

Change 680380 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-logging-external - update networkpolicy with new kafka brokers

https://gerrit.wikimedia.org/r/680380

Change 680380 merged by Ottomata:

[operations/deployment-charts@master] eventgate-logging-external - update networkpolicy with new kafka brokers

https://gerrit.wikimedia.org/r/680380

Change 683012 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: migrate logstash2001 broker to kafka-logging2001

https://gerrit.wikimedia.org/r/683012

Change 683013 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: migrate logstash2002 broker to kafka-logging2002

https://gerrit.wikimedia.org/r/683013

Change 683014 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: migrate logstash2003 broker to kafka-logging2003

https://gerrit.wikimedia.org/r/683014

Change 683047 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] eventgate-logging-external: add new codfw kafka-logging hosts to network policy

https://gerrit.wikimedia.org/r/683047

Change 683050 had a related patch set uploaded (by Herron; author: Herron):

[operations/homer/public@master] add kafka-logging200[123] to kafka term

https://gerrit.wikimedia.org/r/683050

Reimaging the eqiad kafka-logging hosts and configuring them with raid50 layout, this will give us 5T usable (as opposed to the current 3T raid10) per host.

  • kafka-logging1001
  • kafka-logging1002
  • kafka-logging1003

Change 685090 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash101[012]: prep for reimaging

https://gerrit.wikimedia.org/r/685090

Change 685090 merged by Herron:

[operations/puppet@production] logstash101[012]: prep for reimaging

https://gerrit.wikimedia.org/r/685090

Change 683050 merged by jenkins-bot:

[operations/homer/public@master] add kafka-logging200[123] to kafka term

https://gerrit.wikimedia.org/r/683050

Change 683047 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-logging-external: add codfw kafka-logging hosts

https://gerrit.wikimedia.org/r/683047

The kafka-logging100[1-3] hosts have /srv partitions slowly filling up, mostly due to two topics:

elukey@kafka-logging1001:/srv/kafka/data$ sudo du -hs -- * | sort -h | tail -n 20
24G	rsyslog-info-4
24G	rsyslog-info-5
37G	udp_localhost-info-2
37G	udp_localhost-info-5
38G	udp_localhost-info-0
38G	udp_localhost-info-1
38G	udp_localhost-info-3
38G	udp_localhost-info-4
279G	rsyslog-notice-0
279G	rsyslog-notice-1
279G	rsyslog-notice-2
279G	rsyslog-notice-3
279G	rsyslog-notice-4
279G	rsyslog-notice-5
388G	udp_localhost-warning-1
388G	udp_localhost-warning-3
388G	udp_localhost-warning-4
388G	udp_localhost-warning-5
389G	udp_localhost-warning-0
389G	udp_localhost-warning-2

I briefly checked udp_localhost-warning and there is definitely some spammy log from mediawiki ongoing, but maybe it is needed for debugging purposes. Maybe we could lower the retention for some topics?

Change 699418 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/cookbooks@master] sre/kafka/* update kafka cluster choices

https://gerrit.wikimedia.org/r/699418

Change 699418 merged by Ottomata:

[operations/cookbooks@master] sre/kafka/* update kafka cluster choices

https://gerrit.wikimedia.org/r/699418

Change 683012 merged by Herron:

[operations/puppet@production] kafka-logging: migrate logstash2001 broker to kafka-logging2001

https://gerrit.wikimedia.org/r/683012

Change 701440 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] add kafka-logging200[123] to kafka_brokers_logging

https://gerrit.wikimedia.org/r/701440

Change 701440 merged by Herron:

[operations/puppet@production] add kafka-logging200[123] to kafka_brokers_logging

https://gerrit.wikimedia.org/r/701440

Change 683013 merged by Herron:

[operations/puppet@production] kafka-logging: migrate logstash2002 broker to kafka-logging2002

https://gerrit.wikimedia.org/r/683013

Change 683014 merged by Herron:

[operations/puppet@production] kafka-logging: migrate logstash2003 broker to kafka-logging2003

https://gerrit.wikimedia.org/r/683014

herron claimed this task.
herron updated the task description. (Show Details)

Change 777375 had a related patch set uploaded (by Herron; author: Herron):

[operations/cookbooks@master] sre.kafka.reboot-workers: add logging-codfw targets

https://gerrit.wikimedia.org/r/777375

Change 777375 merged by jenkins-bot:

[operations/cookbooks@master] sre.kafka.reboot-workers: add logging-codfw targets

https://gerrit.wikimedia.org/r/777375