Page MenuHomePhabricator

kafka-main100[6789] and kafka-main1010 implementation tracking
Closed, ResolvedPublic

Description

This task has been created by DC Ops for serviceops implementation tracking (per serviceops request when filing racking tasks.)

Once racking task T363212 has been completed, this task can be taken by service ops for implementation. Please note this task is not monitored by DC ops and any questions should be directed to the racking task.

The process for replacing brokers is described in https://wikitech.wikimedia.org/wiki/Kafka/Administration#Hardware_replace_a_broker

Before replacing the first broker here, give traffic a headsup so they can monitor for high latency of cache purges (purged). We saw that happening during T363210: kafka-main200[6789] and kafka-main2010 implementation tracking but I think this is not an issue anymore with the transfer.py approach.

  • kafka-main1006
  • kafka-main1007
  • kafka-main1008
  • kafka-main1009
  • kafka-main1010

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 23 2024, 7:32 PM
akosiaris mentioned this in Unknown Object (Task).Jul 9 2024, 9:17 AM
akosiaris mentioned this in Unknown Object (Task).
akosiaris added subscribers: dcausse, akosiaris.

Mistakenly removed @dcausse, re-adding.

jijiki triaged this task as Medium priority.Oct 17 2024, 1:00 PM
jijiki moved this task from Incoming ๐Ÿซ to Doing ๐Ÿ˜Ž on the serviceops board.
jijiki changed the task status from Stalled to In Progress.Oct 18 2024, 12:04 PM

Change #1089822 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] Add replacement kafka nodes to kafka_brokers_main on eqiad

https://gerrit.wikimedia.org/r/1089822

Change #1089822 merged by Effie Mouzeli:

[operations/puppet@production] Add replacement kafka nodes to kafka_brokers_main on eqiad

https://gerrit.wikimedia.org/r/1089822

Mentioned in SAL (#wikimedia-operations) [2024-11-20T10:22:46Z] <effie> removing leadership from kafka-main1001 - T363214

Icinga downtime and Alertmanager silence (ID=1cb13fea-2d29-4686-a824-95c9972431a0) set by jiji@cumin1002 for 1 day, 0:00:00 on 2 host(s) and their services with reason: Hardware refresh

kafka-main[1001,1006].eqiad.wmnet

Change #1093330 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kafka-main: Replace kafka-main1001 with kafka-main1006

https://gerrit.wikimedia.org/r/1093330

Change #1093330 merged by Effie Mouzeli:

[operations/puppet@production] kafka-main: Replace kafka-main1001 with kafka-main1006

https://gerrit.wikimedia.org/r/1093330

Change #1093337 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1006

https://gerrit.wikimedia.org/r/1093337

Change #1093337 merged by jenkins-bot:

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1006

https://gerrit.wikimedia.org/r/1093337

Icinga downtime and Alertmanager silence (ID=9f9d188a-551c-412a-8d68-ca67db96a150) set by jynus@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Per claime's recommendation

kafka-main1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-11-27T12:38:13Z] <effie> start replacing kafka-main1002 with kafka-main1007 - T363214

Icinga downtime and Alertmanager silence (ID=2081928b-ccb6-4f65-91fa-39f067b89fb2) set by jiji@cumin1002 for 1 day, 0:00:00 on 2 host(s) and their services with reason: Hardware refresh

kafka-main[1002,1007].eqiad.wmnet

Change #1098548 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kafka-main: Replace kafka-main1002 with kafka-main1007

https://gerrit.wikimedia.org/r/1098548

Change #1098548 merged by Effie Mouzeli:

[operations/puppet@production] kafka-main: Replace kafka-main1002 with kafka-main1007

https://gerrit.wikimedia.org/r/1098548

Mentioned in SAL (#wikimedia-operations) [2024-11-27T16:12:48Z] <effie> roll restarting kafka-main brokers - T363214

Change #1098559 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1007

https://gerrit.wikimedia.org/r/1098559

Change #1098559 merged by jenkins-bot:

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1007

https://gerrit.wikimedia.org/r/1098559

Mentioned in SAL (#wikimedia-operations) [2024-12-02T13:31:37Z] <effie> repacing kafka-main1003 in production with kafka-main1008 - T363214

Icinga downtime and Alertmanager silence (ID=1b8e1077-6d61-4aa8-9dd3-51831260ac7d) set by jiji@cumin1002 for 1 day, 0:00:00 on 2 host(s) and their services with reason: Hardware refresh

kafka-main[1002,1007].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=e69f875e-bd0c-45ff-b223-fbcb65b96846) set by jiji@cumin1002 for 1 day, 0:00:00 on 2 host(s) and their services with reason: Hardware refresh

kafka-main[1003,1008].eqiad.wmnet

Change #1099707 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kafka-main: Replace kafka-main1003 with kafka-main1008

https://gerrit.wikimedia.org/r/1099707

Change #1099707 merged by Effie Mouzeli:

[operations/puppet@production] kafka-main: Replace kafka-main1003 with kafka-main1008

https://gerrit.wikimedia.org/r/1099707

Change #1099763 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1008

https://gerrit.wikimedia.org/r/1099763

Change #1099763 merged by jenkins-bot:

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1008

https://gerrit.wikimedia.org/r/1099763

Icinga downtime and Alertmanager silence (ID=1e93a56a-8513-43f0-ac1d-4277501c6e39) set by jiji@cumin1002 for 1 day, 0:00:00 on 2 host(s) and their services with reason: Hardware refresh

kafka-main[1004,1009].eqiad.wmnet

Change #1100447 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kafka-main: Replace kafka-main1004 with kafka-main1009

https://gerrit.wikimedia.org/r/1100447

Change #1100452 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1009 Replacing kafka-main1004

https://gerrit.wikimedia.org/r/1100452

Change #1100447 merged by Effie Mouzeli:

[operations/puppet@production] kafka-main: Replace kafka-main1004 with kafka-main1009

https://gerrit.wikimedia.org/r/1100447

Change #1100452 merged by jenkins-bot:

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1009 Replacing kafka-main1004

https://gerrit.wikimedia.org/r/1100452

Change #1100806 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kafka-main: Replace kafka-main1005 with kafka-main1010

https://gerrit.wikimedia.org/r/1100806

Change #1100807 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] common.yaml: remove firewall rules for kafka-main100[1-5]

https://gerrit.wikimedia.org/r/1100807

Icinga downtime and Alertmanager silence (ID=96be0251-35a2-4695-9aaa-8a46e08acd28) set by jiji@cumin1002 for 1 day, 0:00:00 on 2 host(s) and their services with reason: Hardware refresh

kafka-main[1005,1010].eqiad.wmnet

Change #1100827 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1010 Replacing kafka-main1005

https://gerrit.wikimedia.org/r/1100827

Change #1100806 merged by Effie Mouzeli:

[operations/puppet@production] kafka-main: Replace kafka-main1005 with kafka-main1010

https://gerrit.wikimedia.org/r/1100806

Change #1100827 merged by jenkins-bot:

[operations/deployment-charts@master] Update various kafka-main connection strings for kafka-main1010 Replacing kafka-main1005

https://gerrit.wikimedia.org/r/1100827

All servers have been replaced with the newer ones, decommission task has been filed: T381593. I also made some edits to the very helpful documentation Janis wrote.