Create a cookbook to perform a rolling reboot of a kubernetes cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Aug 18 2020, 8:44 AM

Description

We want to be able to reboot all workers in a kubernetes cluster automatically.

Cookbook is basically done, but there are some issues:

Pods might take slightly longer than termination grace period to actually be removed (addressed in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/811331)
Pods with PodDisruptionBudget regularly fail eviction:

Failed to evict pod Pod(istio-system/istiod-579dbd8c88-psffr) from node Node(kubestage2002.codfw.wmnet): (429)
Reason: Too Many Requests
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 06 Jul 2022 12:40:05 GMT', 'Content-Length': '320'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget istiod needs 1 healthy pods and has 1 currently"}]},"code":429}

Details

Subject	Repo	Branch	Lines +/-
sre.k8s.reboot-nodes: Don't sleep that long between batches	operations/cookbooks	master	+7 -4
k8s: Adapt retry parameters to reality	operations/software/spicerack	master	+13 -10
k8s/reboot-nodes: Error if nodes are cordoned	operations/cookbooks	master	+17 -5
k8s: Retry pod evictions on HTTP 429 from API server	operations/software/spicerack	master	+45 -4
k8s: Add KubernetesNode.taints propertry	operations/software/spicerack	master	+27 -2
k8s: Retry checks for expected pods on drain	operations/software/spicerack	master	+23 -8
k8s.reboot-nodes: Fix call to super()._batchsize	operations/cookbooks	master	+1 -1
sre.k8s.reboot-node: Dynamically adjust batchsize	operations/cookbooks	master	+33 -5
sre.k8s.reboot-nodes: Fix errors identified during dry-run	operations/cookbooks	master	+16 -11
Add a cookbook for rolling reboot of k8s clusters	operations/cookbooks	master	+207 -0
Align cumin aliases for wikikube clusters	operations/puppet	production	+14 -5

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T203943 Spicerack cookbooks TODO list
		Resolved		JMeybohm	T260661 Create a cookbook to perform a rolling reboot of a kubernetes cluster

Event Timeline

Joe created this task.Aug 18 2020, 8:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 18 2020, 8:44 AM

Joe triaged this task as Medium priority.Aug 18 2020, 8:44 AM

JMeybohm mentioned this in T262527: Update to kernel 4.19 on kubernetes nodes.Sep 21 2020, 3:55 PM

jijiki added a project: User-jijiki.Oct 5 2020, 10:58 AM

Volans added a parent task: T203943: Spicerack cookbooks TODO list.Oct 5 2020, 11:00 AM

• ema subscribed.Oct 5 2020, 11:02 AM

MoritzMuehlenhoff subscribed.Oct 5 2020, 1:23 PM

@MoritzMuehlenhoff wrote some generic code to do rolling reboots for groups of hosts that could probably be utilized here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/625597

jijiki moved this task from Incoming🐅 to Next up 🥌 on the User-jijiki board.Oct 20 2020, 10:50 AM

jijiki moved this task from Next up 🥌 to Q1 2021 on the User-jijiki board.Dec 15 2020, 12:06 PM

elukey subscribed.Jan 13 2021, 9:58 AM

JMeybohm added a project: Prod-Kubernetes.Mar 17 2021, 4:54 PM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

jijiki moved this task from Q1 2021 to Incoming🐅 on the User-jijiki board.Jul 12 2021, 4:51 PM

JMeybohm mentioned this in T300879: Add a kubernetes module to spicerack.Feb 3 2022, 5:13 PM

Change 789680 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] Add a cookbook for rolling reboot of k8s clusters

https://gerrit.wikimedia.org/r/789680

gerritbot added a project: Patch-For-Review.May 6 2022, 12:51 PM

JMeybohm claimed this task.May 6 2022, 12:51 PM

Change 790662 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Align cumin aliases for wikikube clusters

https://gerrit.wikimedia.org/r/790662

joanna_borun added a project: Spicerack.Jun 15 2022, 10:49 AM

Change 790662 merged by JMeybohm:

[operations/puppet@production] Align cumin aliases for wikikube clusters

https://gerrit.wikimedia.org/r/790662

Change 789680 merged by jenkins-bot:

[operations/cookbooks@master] Add a cookbook for rolling reboot of k8s clusters

https://gerrit.wikimedia.org/r/789680

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2022, 1:30 PM

Change 806287 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.reboot-nodes: Fix errors identified during dry-run

https://gerrit.wikimedia.org/r/806287

Change 806288 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.reboot-node: Dynamically adjust batchsize

https://gerrit.wikimedia.org/r/806288

Change 806287 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.reboot-nodes: Fix errors identified during dry-run

https://gerrit.wikimedia.org/r/806287

Change 806288 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.reboot-node: Dynamically adjust batchsize

https://gerrit.wikimedia.org/r/806288

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2022, 10:30 AM

Change 807504 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reboot-nodes: Fix call to super()._batchsize

https://gerrit.wikimedia.org/r/807504

gerritbot added a project: Patch-For-Review.Jun 22 2022, 10:37 AM

Change 807504 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reboot-nodes: Fix call to super()._batchsize

https://gerrit.wikimedia.org/r/807504

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2022, 11:30 AM

Change 811331 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Retry checks for expected pods on drain

https://gerrit.wikimedia.org/r/811331

gerritbot added a project: Patch-For-Review.Jul 5 2022, 4:08 PM

Change 811336 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Add KubernetesNode.taints propertry

https://gerrit.wikimedia.org/r/811336

JMeybohm updated the task description. (Show Details)Jul 6 2022, 1:18 PM

Change 811983 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Retry pod evictions on HTTP 429 from API server

https://gerrit.wikimedia.org/r/811983

Change 812325 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s/reboot-nodes: Error if nodes are cordoned

https://gerrit.wikimedia.org/r/812325

Change 811331 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Retry checks for expected pods on drain

https://gerrit.wikimedia.org/r/811331

Change 811336 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Add KubernetesNode.taints propertry

https://gerrit.wikimedia.org/r/811336

Change 811983 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Retry pod evictions on HTTP 429 from API server

https://gerrit.wikimedia.org/r/811983

Change 812325 merged by jenkins-bot:

[operations/cookbooks@master] k8s/reboot-nodes: Error if nodes are cordoned

https://gerrit.wikimedia.org/r/812325

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2022, 10:31 AM

Change 815757 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/spicerack@master] k8s: Adapt retry parameters to reality

https://gerrit.wikimedia.org/r/815757

gerritbot added a project: Patch-For-Review.Jul 20 2022, 3:46 PM

JMeybohm updated the task description. (Show Details)Jul 21 2022, 8:10 AM

Change 815757 merged by jenkins-bot:

[operations/software/spicerack@master] k8s: Adapt retry parameters to reality

https://gerrit.wikimedia.org/r/815757

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:44:44Z] <jayme> dnsdisc depooling codfw for services running in kubernetes cluster (for 30-60min due to T310483, T260661)

Change 824491 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.reboot-nodes: Don't sleep that long between batches

https://gerrit.wikimedia.org/r/824491

JMeybohm updated the task description. (Show Details)Aug 18 2022, 1:54 PM

Reboot of staging clusters and codfw (batchsize 1, took ~3.25 hours) went smoothly without any alerts apart from expected temporary BGP errors. Resolving this.

JMeybohm closed this task as Resolved.Aug 18 2022, 1:58 PM

Change 824491 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.reboot-nodes: Don't sleep that long between batches

https://gerrit.wikimedia.org/r/824491

JMeybohm mentioned this in rCCKB199714bad9bd: Add a cookbook for rolling reboot of k8s clusters.Dec 14 2022, 3:30 PM

JMeybohm mentioned this in rCCKB87d267b1fdd4: sre.k8s.reboot-nodes: Fix errors identified during dry-run.

JMeybohm mentioned this in rCCKB070005e2bffd: sre.k8s.reboot-node: Dynamically adjust batchsize.

JMeybohm mentioned this in rCCKBbfa25dcc7f4c: k8s.reboot-nodes: Fix call to super()._batchsize.

JMeybohm mentioned this in rCCKBaf181e28f27a: k8s/reboot-nodes: Error if nodes are cordoned.

JMeybohm mentioned this in rCCKB2a81c7ae7ac8: sre.k8s.reboot-nodes: Don't sleep that long between batches.

JMeybohm merged a task: T212866: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker.Sep 22 2023, 9:07 AM

JMeybohm added subscribers: Volans, • fsero, akosiaris.