Expand kafka-logging using hosts kafka-logging[12]00[45]
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	herron
	Jan 6 2023, 3:43 PM

Description

With T313960 and T313959 completed we have kafka-logging[12]00[45] ready to be added to the kafka-logging clusters.

At a high level this will involve:

codfw:

Deploy puppetization to new hosts
Validate kafka-logging config on Bullseye
Review/update broker lists and acls through the infra
Rebalance to take full advantage of new hosts (topmicmappr produced a no-op relocation plan)

eqiad:

Replace broker id 1004 using kafka-logging1004
Replace broker id 1005 using kafka-logging1005
Reimage and deploy kafka-logging100[12] with broker ids 100[12]
Review/update broker lists and acls through the infra
Rebalance to take full advantage of new hosts

eqiad reassignment status: 
  mediawiki.httpd.accesslog: completed successfully
  mediawiki.httpd.accesslog-sampled: completed successfully
  rsyslog-info: completed successfully
  rsyslog-notice: completed successfully
  rsyslog-warning: completed successfully
  udp_localhost-err: completed successfully
  udp_localhost-info: completed successfully
  udp_localhost-warning: completed successfully

And as a closely related step we'll also be upgrading the pre-existing kafka-logging[12]00[123] hosts to Bullseye T326420

Note: in T225005 we performed a similar expansion on kafka-main, which we can refer to as inspiration

Details

Subject	Repo	Branch	Lines +/-
services: add kafka-logging100[12] to network rules and broker list	operations/deployment-charts	master	+25 -1
kafka-logging: assign kafka::logging role to kafka-logging100[12]	operations/puppet	production	+1 -5
kafka-logging: add kafka-logging100[12] with node ids 100[12]	operations/puppet	production	+8 -0
kafka-logging: bring up kafka-logging1005 with node id 1005	operations/puppet	production	+8 -13
kafka-logging: stop kafka service on kafka-logging1002	operations/puppet	production	+2 -2
kafka-logging: bring up kafka-logging1004 with node id 1004	operations/puppet	production	+9 -5
kafka-logging: stop kafka services on kafka-logging1001	operations/puppet	production	+5 -1
kafka-logging200[45]: disable notifications	operations/puppet	production	+2 -0
kafka-logging: add kafka-logging200[45] to codfw cluster	operations/puppet	production	+11 -6

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		herron	T326419 Expand kafka-logging using hosts kafka-logging[12]00[45]
		Resolved		herron	T293439 Route rsyslog kafka producer and put kafka-logging codfw in service

Event Timeline

herron triaged this task as Medium priority.Jan 6 2023, 3:43 PM

herron created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 6 2023, 3:43 PM

herron mentioned this in T326420: Kafka-logging Bullseye Upgrades.Jan 6 2023, 3:45 PM

herron updated the task description. (Show Details)Jan 6 2023, 3:48 PM

Change 877257 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: add kafka-logging200[45] to codfw cluster

https://gerrit.wikimedia.org/r/877257

gerritbot added a project: Patch-For-Review.Jan 9 2023, 9:59 PM

Looking at the eqiad kafka-logging cluster this is a good opportunity to re-align broker IDs as well.

Currently the config looks like this, with hostname number and ids not matching (a remnant from when these brokers were colocated with the logstash backends):

logging-eqiad:
  zookeeper_cluster_name: main-eqiad
  brokers:
    kafka-logging1001.eqiad.wmnet:
      id: 1004
      rack: B
    kafka-logging1002.eqiad.wmnet:
      id: 1005
      rack: C
    kafka-logging1003.eqiad.wmnet:
      id: 1006
      rack: D

To address this I'm thinking of approaching the eqiad kafka-logging expansion like this:

Replace broker ids 100[45] with the new hosts kafka-logging100[45]
Reimage kafka-logging100[12] and add the to the cluster as new broker IDs 100[12]
Look further into renumbering kafka-logging1003 as id 1003, potentially as part of the rebalancing stage

herron updated the task description. (Show Details)Jan 10 2023, 3:02 PM

herron updated the task description. (Show Details)Jan 10 2023, 3:06 PM

herron moved this task from Backlog to Working on on the User-herron board.Jan 10 2023, 6:39 PM

lmata removed a project: observability.Jan 11 2023, 2:38 PM

lmata moved this task from Inbox to Prioritized on the Observability-Logging board.

herron added a subtask: T293439: Route rsyslog kafka producer and put kafka-logging codfw in service.Jan 13 2023, 2:40 PM

Change 881652 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging200[45]: disable notifications

https://gerrit.wikimedia.org/r/881652

Change 881652 merged by Herron:

[operations/puppet@production] kafka-logging200[45]: disable notifications

https://gerrit.wikimedia.org/r/881652

Change 877257 merged by Herron:

[operations/puppet@production] kafka-logging: add kafka-logging200[45] to codfw cluster

https://gerrit.wikimedia.org/r/877257

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2023, 3:31 PM

herron updated the task description. (Show Details)Jan 19 2023, 3:55 PM

lmata added a project: SRE Observability (FY2022/2023-Q3).Jan 20 2023, 2:25 PM

lmata moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q3) board.

Change 900336 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: stop kafka services on kafka-logging1001

https://gerrit.wikimedia.org/r/900336

Change 900337 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: bring up kafka-logging1004 with node id 1004

https://gerrit.wikimedia.org/r/900337

Change 900336 merged by Herron:

[operations/puppet@production] kafka-logging: stop kafka services on kafka-logging1001

https://gerrit.wikimedia.org/r/900336

Change 900337 merged by Herron:

[operations/puppet@production] kafka-logging: bring up kafka-logging1004 with node id 1004

https://gerrit.wikimedia.org/r/900337

Maintenance_bot removed a project: Patch-For-Review.Apr 10 2023, 3:30 PM

herron updated the task description. (Show Details)Apr 10 2023, 4:14 PM

Change 907504 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: stop kafka service on kafka-logging1002

https://gerrit.wikimedia.org/r/907504

Change 907505 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: bring up kafka-logging1005 with node id 1005

https://gerrit.wikimedia.org/r/907505

Change 907504 merged by Herron:

[operations/puppet@production] kafka-logging: stop kafka service on kafka-logging1002

https://gerrit.wikimedia.org/r/907504

Change 907505 merged by Herron:

[operations/puppet@production] kafka-logging: bring up kafka-logging1005 with node id 1005

https://gerrit.wikimedia.org/r/907505

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2023, 4:29 PM

@herron Hi! Let's pause for a moment this task to coordinate, I think that we have some issues in T334510 due to the new configs :(

elukey mentioned this in T334510: Produce requests to eventgate-logging-external in eqiad occasionally fail..Apr 19 2023, 8:48 AM

BTullis subscribed.Apr 19 2023, 8:53 AM

herron updated the task description. (Show Details)Apr 19 2023, 1:56 PM

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1001 for host kafka-logging1001.eqiad.wmnet with OS bullseye

Change 911872 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] services: add kafka-logging100[12] to network rules and broker list

https://gerrit.wikimedia.org/r/911872

gerritbot added a project: Patch-For-Review.Apr 25 2023, 2:14 PM

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1001 for host kafka-logging1001.eqiad.wmnet with OS bullseye completed:

kafka-logging1001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202304251401_herron_3759326_kafka-logging1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1001 for host kafka-logging1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1001 for host kafka-logging1002.eqiad.wmnet with OS bullseye completed:

kafka-logging1002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202304251435_herron_3770126_kafka-logging1002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 911883 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: add kafka-logging100[12] with node ids 100[12]

https://gerrit.wikimedia.org/r/911883

Change 911883 merged by Herron:

[operations/puppet@production] kafka-logging: add kafka-logging100[12] with node ids 100[12]

https://gerrit.wikimedia.org/r/911883

Change 911888 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-logging: assign kafka::logging role to kafka-logging100[12]

https://gerrit.wikimedia.org/r/911888

Change 911888 merged by Herron:

[operations/puppet@production] kafka-logging: assign kafka::logging role to kafka-logging100[12]

https://gerrit.wikimedia.org/r/911888

herron updated the task description. (Show Details)Apr 25 2023, 5:37 PM

Change 911872 merged by Elukey:

[operations/deployment-charts@master] services: add kafka-logging100[12] to network rules and broker list

https://gerrit.wikimedia.org/r/911872

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 7:29 AM

lmata edited projects, added SRE Observability (FY2022/2023-Q4); removed SRE Observability (FY2022/2023-Q3).May 2 2023, 1:45 PM

lmata moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q4) board.May 2 2023, 1:57 PM

herron updated the task description. (Show Details)May 17 2023, 1:54 PM

herron moved this task from Working on to Backlog on the User-herron board.

lmata edited projects, added SRE Observability (FY2023/2024-Q1); removed SRE Observability (FY2022/2023-Q4).Jul 10 2023, 2:28 PM

lmata edited projects, added SRE Observability (FY2023/2024-Q2); removed SRE Observability (FY2023/2024-Q1).Oct 9 2023, 4:27 PM

lmata edited projects, added SRE Observability (FY2023/2024-Q3); removed SRE Observability (FY2023/2024-Q2).Dec 6 2023, 3:16 PM

herron closed subtask T293439: Route rsyslog kafka producer and put kafka-logging codfw in service as Resolved.Dec 7 2023, 5:29 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2023/2024-Q3) board.Jan 12 2024, 6:05 PM

Using topicmappr I've generated the below plan to rebalance kafka-logging eqiad:

Broker 1006 relocations planned:
  [259.35GB] mediawiki.httpd.accesslog p1 -> 1002
  [259.18GB] mediawiki.httpd.accesslog p5 -> 1001
  [259.03GB] mediawiki.httpd.accesslog p2 -> 1002
  [259.00GB] mediawiki.httpd.accesslog p0 -> 1001
  [144.00GB] udp_localhost-warning p4 -> 1002
  [58.22GB] udp_localhost-info p3 -> 1001
  [38.06GB] rsyslog-notice p1 -> 1001

Broker 1005 relocations planned:
  [259.35GB] mediawiki.httpd.accesslog p1 -> 1001
  [259.10GB] mediawiki.httpd.accesslog p4 -> 1002
  [259.03GB] mediawiki.httpd.accesslog p2 -> 1001
  [258.60GB] mediawiki.httpd.accesslog p3 -> 1002
  [144.00GB] udp_localhost-warning p4 -> 1001
  [58.22GB] udp_localhost-info p3 -> 1002
  [17.45GB] udp_localhost-err p3 -> 1001
  [10.62GB] rsyslog-info p1 -> 1001
  [3.44GB] mediawiki.httpd.accesslog-sampled p5 -> 1002
  [2.71GB] rsyslog-warning p5 -> 1001

Broker 1004 relocations planned:
  [259.18GB] mediawiki.httpd.accesslog p5 -> 1002
  [259.10GB] mediawiki.httpd.accesslog p4 -> 1001
  [259.00GB] mediawiki.httpd.accesslog p0 -> 1002
  [258.60GB] mediawiki.httpd.accesslog p3 -> 1001
  [143.98GB] udp_localhost-warning p0 -> 1002
  [57.96GB] udp_localhost-info p5 -> 1001
  [10.71GB] rsyslog-info p2 -> 1001
  [3.43GB] mediawiki.httpd.accesslog-sampled p4 -> 1001
  [3.42GB] mediawiki.httpd.accesslog-sampled p0 -> 1001

Broker distribution:
  degree [min/max/avg]: 2/4/3.20 -> 2/4/3.20
  -
  Broker 2001 - leader: 55, follower: 112, total: 167
  Broker 2002 - leader: 55, follower: 109, total: 164
  Broker 2003 - leader: 55, follower: 110, total: 165
  Broker 2004 - leader: 2, follower: 7, total: 9
  Broker 2005 - leader: 2, follower: 0, total: 2

Storage free change estimations:
  range: 1725.68GB -> 1725.68GB
  range spread: 76.92% -> 76.92%
  std. deviation: 845.14GB -> 845.14GB
  min-max: 2243.60GB, 3969.27GB -> 2243.60GB, 3969.27GB
  -
  Broker 2001: 2245.10 -> 2245.10 (+0.00GB, 0.00%)
  Broker 2002: 2243.60 -> 2243.60 (+0.00GB, 0.00%)
  Broker 2003: 2243.73 -> 2243.73 (+0.00GB, 0.00%)
  Broker 2004: 3969.27 -> 3969.27 (+0.00GB, 0.00%)
  Broker 2005: 3969.27 -> 3969.27 (+0.00GB, 0.00%)

I'll plan to begin applying this one topic at a time over the next days.

Mentioned in SAL (#wikimedia-operations) [2024-03-18T17:56:51Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file mediawiki.httpd.accesslog.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 1:48 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T13:50:51Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file mediawiki.httpd.accesslog-sampled.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 1:53 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T13:55:43Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-info.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 2:10 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:12:46Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-notice.json --execute --throttle 50000000 T326419

lmata changed the task status from Open to In Progress.Mar 19 2024, 3:29 PM

lmata moved this task from Up next to In progress on the SRE Observability (FY2023/2024-Q3) board.

herron updated the task description. (Show Details)Mar 19 2024, 3:34 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:35:48Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file rsyslog-warning.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 3:39 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:39:48Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-err.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 3:54 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:54:17Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-info.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 4:43 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-19T16:45:09Z] <herron> kafka-logging1001:~# kafka reassign-partitions -reassignment-json-file udp_localhost-warning.json --execute --throttle 50000000 T326419

herron updated the task description. (Show Details)Mar 19 2024, 5:34 PM

herron updated the task description. (Show Details)Mar 19 2024, 9:19 PM

The relocation/rebalance plan outlined in T326419#9639228 has finished running, and a re-run of topicmappr with fresh metrics now shows no proposed moves. Resolving!

lmata moved this task from In progress to Done on the SRE Observability (FY2023/2024-Q3) board.Tue, Mar 26, 4:02 PM

Expand kafka-logging using hosts kafka-logging[12]00[45]Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Expand kafka-logging using hosts kafka-logging[12]00[45]
Closed, ResolvedPublic
Actions

Related Objects
Search...