We want to be able to reboot all workers in a kubernetes cluster automatically.
Cookbook is basically done, but there are some issues:
- Pods might take slightly longer than termination grace period to actually be removed (addressed in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/811331)
- Pods with PodDisruptionBudget regularly fail eviction:
Failed to evict pod Pod(istio-system/istiod-579dbd8c88-psffr) from node Node(kubestage2002.codfw.wmnet): (429) Reason: Too Many Requests HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 06 Jul 2022 12:40:05 GMT', 'Content-Length': '320'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget istiod needs 1 healthy pods and has 1 currently"}]},"code":429}