Page MenuHomePhabricator

Create a cookbook to perform a rolling reboot of a kubernetes cluster
Closed, ResolvedPublic

Description

We want to be able to reboot all workers in a kubernetes cluster automatically.

Cookbook is basically done, but there are some issues:

Failed to evict pod Pod(istio-system/istiod-579dbd8c88-psffr) from node Node(kubestage2002.codfw.wmnet): (429)
Reason: Too Many Requests
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 06 Jul 2022 12:40:05 GMT', 'Content-Length': '320'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget istiod needs 1 healthy pods and has 1 currently"}]},"code":429}

Event Timeline

Joe triaged this task as Medium priority.Aug 18 2020, 8:44 AM

@MoritzMuehlenhoff wrote some generic code to do rolling reboots for groups of hosts that could probably be utilized here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/625597

Change 789680 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] Add a cookbook for rolling reboot of k8s clusters

https://gerrit.wikimedia.org/r/789680

Change 790662 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Align cumin aliases for wikikube clusters

https://gerrit.wikimedia.org/r/790662

Change 790662 merged by JMeybohm:

[operations/puppet@production] Align cumin aliases for wikikube clusters

https://gerrit.wikimedia.org/r/790662

Change 789680 merged by jenkins-bot:

[operations/cookbooks@master] Add a cookbook for rolling reboot of k8s clusters

https://gerrit.wikimedia.org/r/789680

Change 806287 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.reboot-nodes: Fix errors identified during dry-run

https://gerrit.wikimedia.org/r/806287

Change 806288 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.reboot-node: Dynamically adjust batchsize

https://gerrit.wikimedia.org/r/806288

Change 806287 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.reboot-nodes: Fix errors identified during dry-run

https://gerrit.wikimedia.org/r/806287

Change 806288 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.reboot-node: Dynamically adjust batchsize

https://gerrit.wikimedia.org/r/806288

Change 807504 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reboot-nodes: Fix call to super()._batchsize

https://gerrit.wikimedia.org/r/807504

Change 807504 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reboot-nodes: Fix call to super()._batchsize

https://gerrit.wikimedia.org/r/807504

Change 811331 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Retry checks for expected pods on drain

https://gerrit.wikimedia.org/r/811331

Change 811336 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Add KubernetesNode.taints propertry

https://gerrit.wikimedia.org/r/811336

Change 811983 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Retry pod evictions on HTTP 429 from API server

https://gerrit.wikimedia.org/r/811983

Change 812325 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s/reboot-nodes: Error if nodes are cordoned

https://gerrit.wikimedia.org/r/812325

Change 811331 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Retry checks for expected pods on drain

https://gerrit.wikimedia.org/r/811331

Change 811336 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Add KubernetesNode.taints propertry

https://gerrit.wikimedia.org/r/811336

Change 811983 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Retry pod evictions on HTTP 429 from API server

https://gerrit.wikimedia.org/r/811983

Change 812325 merged by jenkins-bot:

[operations/cookbooks@master] k8s/reboot-nodes: Error if nodes are cordoned

https://gerrit.wikimedia.org/r/812325

Change 815757 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Adapt retry parameters to reality

https://gerrit.wikimedia.org/r/815757

Change 815757 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Adapt retry parameters to reality

https://gerrit.wikimedia.org/r/815757

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:44:44Z] <jayme> dnsdisc depooling codfw for services running in kubernetes cluster (for 30-60min due to T310483, T260661)

Change 824491 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.reboot-nodes: Don't sleep that long between batches

https://gerrit.wikimedia.org/r/824491

Reboot of staging clusters and codfw (batchsize 1, took ~3.25 hours) went smoothly without any alerts apart from expected temporary BGP errors. Resolving this.

Change 824491 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.reboot-nodes: Don't sleep that long between batches

https://gerrit.wikimedia.org/r/824491