Page MenuHomePhabricator

sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror':
Closed, ResolvedPublic

Description

The kafka-logging clusters don't use mirrormaker, but sre.kafka.reboot-worker currently assumes it will be present which causes the run to fail on these clusters

Stopping kafka processes on host kafka-logging2001.codfw.wmnet
----- OUTPUT of 'systemctl stop kafka-mirror' -----
Failed to stop kafka-mirror.service: Unit kafka-mirror.service not loaded.
================
PASS |                                                                                                                                                                                                                      |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.30hosts/s]
100.0% (1/1) of nodes failed to execute command 'systemctl stop kafka-mirror': kafka-logging2001.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'systemctl stop kafka-mirror'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Enabling Puppet with reason "Reboot kafka nodes - herron@cumin1001" on 1 hosts: kafka-logging2001.codfw.wmnet
----- OUTPUT of 'enable-puppet "R...erron@cumin1001"' -----
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:02<00:00,  2.41s/hosts]
FAIL |                                                                                                                                                                                                                      |   0% (0/1) [00:02<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "R...erron@cumin1001"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Exception raised while executing cookbook sre.kafka.reboot-workers:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/kafka/reboot-workers.py", line 117, in run
    self.reboot_kafka_node(host)
  File "/srv/deployment/spicerack/cookbooks/sre/kafka/reboot-workers.py", line 79, in reboot_kafka_node
    node.run_sync('systemctl stop kafka-mirror')
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 520, in run_sync
    return self._execute(
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.kafka.reboot-workers (exit_code=99) for Kafka logging-codfw cluster: Reboot kafka nodes

Event Timeline

herron renamed this task from sre.kafka.reboot-workers fails on logging cluster with 100.0% (1/1) of nodes failed to execute command 'systemctl stop kafka-mirror': to sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror':.Apr 7 2022, 5:51 PM
herron triaged this task as Medium priority.
herron added projects: SRE, SRE Observability.

Change 778325 had a related patch set uploaded (by Herron; author: Herron):

[operations/cookbooks@master] sre.kafka.reboot-workers: add --skip-mirrormaker option

https://gerrit.wikimedia.org/r/778325

Change 778517 had a related patch set uploaded (by Herron; author: Herron):

[operations/cookbooks@master] sre.kafka.reboot-workers: remove systemctl stop calls

https://gerrit.wikimedia.org/r/778517

Change 779086 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] kafka-mirror: startup after kafka.service, shutdown before kafka.service

https://gerrit.wikimedia.org/r/779086

Change 778325 abandoned by Herron:

[operations/cookbooks@master] sre.kafka.reboot-workers: add --skip-mirrormaker option

Reason:

abandoning in favor of I0b63781760e7

https://gerrit.wikimedia.org/r/778325

Change 778517 merged by jenkins-bot:

[operations/cookbooks@master] sre.kafka.reboot-workers: remove systemctl stop calls

https://gerrit.wikimedia.org/r/778517

Change 779086 abandoned by Herron:

[operations/puppet@production] kafka-mirror: startup after kafka.service, shutdown before kafka.service

Reason:

https://gerrit.wikimedia.org/r/779086

herron claimed this task.

A round of kafka-logging rolling reboots was completed today using sre.kafka.reboot-workers. Resolving!