Page MenuHomePhabricator

Expand kafka-logging using hosts kafka-logging[12]00[45]
Closed, ResolvedPublic

Description

With T313960 and T313959 completed we have kafka-logging[12]00[45] ready to be added to the kafka-logging clusters.

At a high level this will involve:

codfw:

  • Deploy puppetization to new hosts
  • Validate kafka-logging config on Bullseye
  • Review/update broker lists and acls through the infra
  • Rebalance to take full advantage of new hosts (topmicmappr produced a no-op relocation plan)

eqiad:

  • Replace broker id 1004 using kafka-logging1004
  • Replace broker id 1005 using kafka-logging1005
  • Reimage and deploy kafka-logging100[12] with broker ids 100[12]
  • Review/update broker lists and acls through the infra
  • Rebalance to take full advantage of new hosts
eqiad reassignment status: 
  mediawiki.httpd.accesslog: completed successfully
  mediawiki.httpd.accesslog-sampled: completed successfully
  rsyslog-info: completed successfully
  rsyslog-notice: completed successfully
  rsyslog-warning: completed successfully
  udp_localhost-err: completed successfully
  udp_localhost-info: completed successfully
  udp_localhost-warning: completed successfully

And as a closely related step we'll also be upgrading the pre-existing kafka-logging[12]00[123] hosts to Bullseye T326420

Note: in T225005 we performed a similar expansion on kafka-main, which we can refer to as inspiration

Event Timeline

herron triaged this task as Medium priority.Jan 6 2023, 3:43 PM
herron created this task.

Change 877257 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: add kafka-logging200[45] to codfw cluster

https://gerrit.wikimedia.org/r/877257

Looking at the eqiad kafka-logging cluster this is a good opportunity to re-align broker IDs as well.

Currently the config looks like this, with hostname number and ids not matching (a remnant from when these brokers were colocated with the logstash backends):

logging-eqiad:
  zookeeper_cluster_name: main-eqiad
  brokers:
    kafka-logging1001.eqiad.wmnet:
      id: 1004
      rack: B
    kafka-logging1002.eqiad.wmnet:
      id: 1005
      rack: C
    kafka-logging1003.eqiad.wmnet:
      id: 1006
      rack: D

To address this I'm thinking of approaching the eqiad kafka-logging expansion like this:

  • Replace broker ids 100[45] with the new hosts kafka-logging100[45]
  • Reimage kafka-logging100[12] and add the to the cluster as new broker IDs 100[12]
  • Look further into renumbering kafka-logging1003 as id 1003, potentially as part of the rebalancing stage

Change 881652 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging200[45]: disable notifications

https://gerrit.wikimedia.org/r/881652

Change 881652 merged by Herron:

[operations/puppet@production] kafka-logging200[45]: disable notifications

https://gerrit.wikimedia.org/r/881652

Change 877257 merged by Herron:

[operations/puppet@production] kafka-logging: add kafka-logging200[45] to codfw cluster

https://gerrit.wikimedia.org/r/877257

Change 900336 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: stop kafka services on kafka-logging1001

https://gerrit.wikimedia.org/r/900336

Change 900337 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: bring up kafka-logging1004 with node id 1004

https://gerrit.wikimedia.org/r/900337

Change 900336 merged by Herron:

[operations/puppet@production] kafka-logging: stop kafka services on kafka-logging1001

https://gerrit.wikimedia.org/r/900336

Change 900337 merged by Herron:

[operations/puppet@production] kafka-logging: bring up kafka-logging1004 with node id 1004

https://gerrit.wikimedia.org/r/900337

Change 907504 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: stop kafka service on kafka-logging1002

https://gerrit.wikimedia.org/r/907504

Change 907505 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: bring up kafka-logging1005 with node id 1005

https://gerrit.wikimedia.org/r/907505

Change 907504 merged by Herron:

[operations/puppet@production] kafka-logging: stop kafka service on kafka-logging1002

https://gerrit.wikimedia.org/r/907504

Change 907505 merged by Herron:

[operations/puppet@production] kafka-logging: bring up kafka-logging1005 with node id 1005

https://gerrit.wikimedia.org/r/907505

@herron Hi! Let's pause for a moment this task to coordinate, I think that we have some issues in T334510 due to the new configs :(

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1001 for host kafka-logging1001.eqiad.wmnet with OS bullseye

Change 911872 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] services: add kafka-logging100[12] to network rules and broker list

https://gerrit.wikimedia.org/r/911872

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1001 for host kafka-logging1001.eqiad.wmnet with OS bullseye completed:

  • kafka-logging1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202304251401_herron_3759326_kafka-logging1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1001 for host kafka-logging1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1001 for host kafka-logging1002.eqiad.wmnet with OS bullseye completed:

  • kafka-logging1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202304251435_herron_3770126_kafka-logging1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 911883 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: add kafka-logging100[12] with node ids 100[12]

https://gerrit.wikimedia.org/r/911883

Change 911883 merged by Herron:

[operations/puppet@production] kafka-logging: add kafka-logging100[12] with node ids 100[12]

https://gerrit.wikimedia.org/r/911883

Change 911888 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: assign kafka::logging role to kafka-logging100[12]

https://gerrit.wikimedia.org/r/911888

Change 911888 merged by Herron:

[operations/puppet@production] kafka-logging: assign kafka::logging role to kafka-logging100[12]

https://gerrit.wikimedia.org/r/911888

Change 911872 merged by Elukey:

[operations/deployment-charts@master] services: add kafka-logging100[12] to network rules and broker list

https://gerrit.wikimedia.org/r/911872

herron moved this task from Working on to Backlog on the User-herron board.

Using topicmappr I've generated the below plan to rebalance kafka-logging eqiad:

Broker 1006 relocations planned:
  [259.35GB] mediawiki.httpd.accesslog p1 -> 1002
  [259.18GB] mediawiki.httpd.accesslog p5 -> 1001
  [259.03GB] mediawiki.httpd.accesslog p2 -> 1002
  [259.00GB] mediawiki.httpd.accesslog p0 -> 1001
  [144.00GB] udp_localhost-warning p4 -> 1002
  [58.22GB] udp_localhost-info p3 -> 1001
  [38.06GB] rsyslog-notice p1 -> 1001

Broker 1005 relocations planned:
  [259.35GB] mediawiki.httpd.accesslog p1 -> 1001
  [259.10GB] mediawiki.httpd.accesslog p4 -> 1002
  [259.03GB] mediawiki.httpd.accesslog p2 -> 1001
  [258.60GB] mediawiki.httpd.accesslog p3 -> 1002
  [144.00GB] udp_localhost-warning p4 -> 1001
  [58.22GB] udp_localhost-info p3 -> 1002
  [17.45GB] udp_localhost-err p3 -> 1001
  [10.62GB] rsyslog-info p1 -> 1001
  [3.44GB] mediawiki.httpd.accesslog-sampled p5 -> 1002
  [2.71GB] rsyslog-warning p5 -> 1001

Broker 1004 relocations planned:
  [259.18GB] mediawiki.httpd.accesslog p5 -> 1002
  [259.10GB] mediawiki.httpd.accesslog p4 -> 1001
  [259.00GB] mediawiki.httpd.accesslog p0 -> 1002
  [258.60GB] mediawiki.httpd.accesslog p3 -> 1001
  [143.98GB] udp_localhost-warning p0 -> 1002
  [57.96GB] udp_localhost-info p5 -> 1001
  [10.71GB] rsyslog-info p2 -> 1001
  [3.43GB] mediawiki.httpd.accesslog-sampled p4 -> 1001
  [3.42GB] mediawiki.httpd.accesslog-sampled p0 -> 1001

Broker distribution:
  degree [min/max/avg]: 2/4/3.20 -> 2/4/3.20
  -
  Broker 2001 - leader: 55, follower: 112, total: 167
  Broker 2002 - leader: 55, follower: 109, total: 164
  Broker 2003 - leader: 55, follower: 110, total: 165
  Broker 2004 - leader: 2, follower: 7, total: 9
  Broker 2005 - leader: 2, follower: 0, total: 2

Storage free change estimations:
  range: 1725.68GB -> 1725.68GB
  range spread: 76.92% -> 76.92%
  std. deviation: 845.14GB -> 845.14GB
  min-max: 2243.60GB, 3969.27GB -> 2243.60GB, 3969.27GB
  -
  Broker 2001: 2245.10 -> 2245.10 (+0.00GB, 0.00%)
  Broker 2002: 2243.60 -> 2243.60 (+0.00GB, 0.00%)
  Broker 2003: 2243.73 -> 2243.73 (+0.00GB, 0.00%)
  Broker 2004: 3969.27 -> 3969.27 (+0.00GB, 0.00%)
  Broker 2005: 3969.27 -> 3969.27 (+0.00GB, 0.00%)

I'll plan to begin applying this one topic at a time over the next days.

Mentioned in SAL (#wikimedia-operations) [2024-03-18T17:56:51Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file mediawiki.httpd.accesslog.json --execute --throttle 50000000 T326419

Mentioned in SAL (#wikimedia-operations) [2024-03-19T13:50:51Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file mediawiki.httpd.accesslog-sampled.json --execute --throttle 50000000 T326419

Mentioned in SAL (#wikimedia-operations) [2024-03-19T13:55:43Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-info.json --execute --throttle 50000000 T326419

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:12:46Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-notice.json --execute --throttle 50000000 T326419

lmata changed the task status from Open to In Progress.Mar 19 2024, 3:29 PM
lmata moved this task from Up next to In progress on the SRE Observability (FY2023/2024-Q3) board.

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:35:48Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-warning.json --execute --throttle 50000000 T326419

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:39:48Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-err.json --execute --throttle 50000000 T326419

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:54:17Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-info.json --execute --throttle 50000000 T326419

Mentioned in SAL (#wikimedia-operations) [2024-03-19T16:45:09Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-warning.json --execute --throttle 50000000 T326419

herron claimed this task.

The relocation/rebalance plan outlined in T326419#9639228 has finished running, and a re-run of topicmappr with fresh metrics now shows no proposed moves. Resolving!