Page MenuHomePhabricator

cookbooks: sre.hosts.reboot-single update to support disabled puppet
Closed, ResolvedPublic

Description

If puppet is disabled when one runs the sre.hosts.reboot-single then we end up in a loop where the cookbook is constantly looking for a more recent puppet run. In the cookbook we should add a switch to enable puppet and also exit early if puppet is disabled

Related Objects

Event Timeline

jbond triaged this task as Medium priority.Dec 14 2022, 12:26 PM
jbond created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 868077 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/cookbooks@master] sre.hosts.reboot-single: add ability to enable host on reboot

https://gerrit.wikimedia.org/r/868077

Change 868430 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/cookbooks@master] sre.hosts.reboot-single: Simplify icinga logic

https://gerrit.wikimedia.org/r/868430

Change 868430 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reboot-single: Simplify icinga logic

https://gerrit.wikimedia.org/r/868430

Change 868077 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reboot-single: add ability to enable host on reboot

https://gerrit.wikimedia.org/r/868077

reboot singloe cookbook now updated

RobH added subscribers: elukey, RobH.

Please note we've run into this issue again today:

During the work of sprint week in reimaging hosts, Luca was using the sre.hardware.upgrade-firmware cookbook on kafka-main1005 and ran into an issue where if the puppet agent is disabled on a host (say due to it being taken offline for firmware updates and reimage) the script will fail to reboot it due to that state.

Since this is a normal state for reimaging, this should likely be fixed. example of error: P45904

Current Workaround:

  • fire firmware update script, see reboot call fail for puppet being disabled on host
  • manually fire reboot of host via cli in a different terminal window, keeping script running
  • monitor script for successful firmware update

Adding more context - I needed to stop gracefully kafka on the node and I've disabled puppet to avoid getting the daemon back in running state. Rebooting a kafka node without a graceful shutdown of the kafka daemon is fine but it is better to avoid it if we can do it. Lemme know what's best :)

Change 902009 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.reboot-single.py: replace "pool" with "depool"

https://gerrit.wikimedia.org/r/902009

Change 902009 merged by Elukey:

[operations/cookbooks@master] sre.hosts.reboot-single.py: replace "pool" with "depool"

https://gerrit.wikimedia.org/r/902009

Change 902013 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.reboot-single: set self.depool in any case

https://gerrit.wikimedia.org/r/902013

Change 902013 merged by Elukey:

[operations/cookbooks@master] sre.hosts.reboot-single: set self.depool in any case

https://gerrit.wikimedia.org/r/902013

Change 902015 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.reboot-single: fix corner case when puppet is disabled

https://gerrit.wikimedia.org/r/902015

Change 902026 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/cookbooks@master] Revert "sre.hosts.reboot-single: set self.depool in any case"

https://gerrit.wikimedia.org/r/902026

Change 902015 abandoned by Elukey:

[operations/cookbooks@master] sre.hosts.reboot-single: fix corner case when puppet is disabled

Reason:

https://gerrit.wikimedia.org/r/902015

Change 902026 merged by jenkins-bot:

[operations/cookbooks@master] Revert "sre.hosts.reboot-single: set self.depool in any case"

https://gerrit.wikimedia.org/r/902026