Page MenuHomePhabricator

toolforge kubernetes: create roll-reboot cookbook
Closed, ResolvedPublic

Description

Having a cookbook to roll-reboot all kubernetes nodes would be very handy in some maintenance situations.

Also, a small & fun coding project.

Something similar to this:

#!/bin/bash

SLEEP="30"

HOSTS="
tools-k8s-worker-30
tools-k8s-worker-31
tools-k8s-worker-32
tools-k8s-worker-33
tools-k8s-worker-34
tools-k8s-worker-35
tools-k8s-worker-36
tools-k8s-worker-37
tools-k8s-worker-38
tools-k8s-worker-39
tools-k8s-worker-40
tools-k8s-worker-41
tools-k8s-worker-42
tools-k8s-worker-43
tools-k8s-worker-44
tools-k8s-worker-45
tools-k8s-worker-46
tools-k8s-worker-47
tools-k8s-worker-48
tools-k8s-worker-49
tools-k8s-worker-50
tools-k8s-worker-51
tools-k8s-worker-52
tools-k8s-worker-53
tools-k8s-worker-54
tools-k8s-worker-55
tools-k8s-worker-56
tools-k8s-worker-57
tools-k8s-worker-58
tools-k8s-worker-59
tools-k8s-worker-60
tools-k8s-worker-61
tools-k8s-worker-62
tools-k8s-worker-64
tools-k8s-worker-65
tools-k8s-worker-66
tools-k8s-worker-67
tools-k8s-worker-68
tools-k8s-worker-69
tools-k8s-worker-70
tools-k8s-worker-71
tools-k8s-worker-72
tools-k8s-worker-73
tools-k8s-worker-74
tools-k8s-worker-75
tools-k8s-worker-76
tools-k8s-worker-77
tools-k8s-worker-78
tools-k8s-worker-79
tools-k8s-worker-80
tools-k8s-worker-81
tools-k8s-worker-82
"

for worker in $HOSTS ; do
    echo "Rebooting $worker"
    ssh $worker "sudo reboot"
    sleep $SLEEP

    while true ; do 
        echo "Checking $worker"
        ssh $worker uptime && break
        echo "waiting"
    done

done

Event Timeline

Change 904785 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] toolforge: k8s: add reboot cookbook

https://gerrit.wikimedia.org/r/904785

Change 905166 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] wmcs_libs: k8s: factorize wait for node to drain

https://gerrit.wikimedia.org/r/905166

aborrero triaged this task as Medium priority.Apr 3 2023, 11:59 AM

Change 905166 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] wmcs_libs: k8s: factorize wait for node to drain

https://gerrit.wikimedia.org/r/905166

Change 904785 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: k8s: add reboot cookbook

https://gerrit.wikimedia.org/r/904785

After using this cookbook for real yesterday, some potential improvements were detected:

  • add an option to make the drain faster / more aggressive, perhaps with a combo of kubectl options such as --grace-period=0 --skip-wait-for-delete-timeout=1 and --timeout=10s etc
  • add an option to force reboot if the drain times out
  • we need a way to restart the cookbook at a certain node. Not sure yet how the implementation would look like. Perhaps, accept a node list from a yaml file that we can maintain by hand on the fly

In addition to those I have a couple more feature requests

  • add ability to filter for certain types of nodes, for example normal workers only
  • add ability to work on multiple nodes at once to speed things up
aborrero changed the task status from Open to In Progress.Apr 5 2023, 11:31 AM
aborrero claimed this task.
aborrero moved this task from Backlog to In Progress on the Toolforge board.
aborrero moved this task from Inbox to Soon! on the cloud-services-team board.

Change 905990 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] wmcs_libs/k8s/kubernetes: do a harder drain

https://gerrit.wikimedia.org/r/905990

Change 905990 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] wmcs_libs/k8s/kubernetes: do a harder drain

https://gerrit.wikimedia.org/r/905990

Change 906059 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] toolforge: k8s: reboot: catch timeouts when draining

https://gerrit.wikimedia.org/r/906059

Change 906059 abandoned by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] toolforge: k8s: reboot: catch timeouts when draining, and ignore them

Reason:

https://gerrit.wikimedia.org/r/906059

aborrero changed the task status from In Progress to Stalled.Jun 7 2023, 10:40 AM
aborrero removed aborrero as the assignee of this task.
aborrero moved this task from In Progress to Feature requests on the Toolforge board.
aborrero removed a project: User-aborrero.

Not working on this at the moment.

aborrero claimed this task.

We have a working cookbook now wmcs.toolforge.k8s.reboot.