Page MenuHomePhabricator

toolforge: Automatically stop pods in CrashLoopBackOff (and notify tool maintainers)
Open, MediumPublic

Description

The Toolforge Kubernetes cluster has a suprisingly large number of pods that are always in CrashLoopBackOff state. No-one seems to notice or care about said failing pods and the pods likely can't do anything useful if they are constantly crashing, but they still consume resources on the cluster.

I propose adding a cronjob of some sort to regularly look for pods that haven't been able to start for a while and just removes that pod (and any deployments that control it). The script should send a warning in advance for the tool maintainers to fix the problem and a notification after killing it.

Event Timeline

Change 742237 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/delete-crashing-pods@master] Initial commit

https://gerrit.wikimedia.org/r/742237

Change 742237 merged by jenkins-bot:

[cloud/toolforge/delete-crashing-pods@master] Initial commit

https://gerrit.wikimedia.org/r/742237

Mentioned in SAL (#wikimedia-cloud) [2021-12-04T12:18:29Z] <majavah> deploying delete-crashing-pods in dry run mode T292925

Change 743574 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] toolforge: provision delete-crashing-pods values

https://gerrit.wikimedia.org/r/743574

The currently implemented parts of the cron job seem to work fine. Next step is to write the notification email sending code and puppetize installing the values file.

Change 745984 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/delete-crashing-pods@master] Add support for email notifications

https://gerrit.wikimedia.org/r/745984

Change 745984 merged by jenkins-bot:

[cloud/toolforge/delete-crashing-pods@master] Add support for email notifications

https://gerrit.wikimedia.org/r/745984

Mentioned in SAL (#wikimedia-cloud) [2021-12-14T09:46:22Z] <majavah> testing delete-crashing-pods emailer component with a test tool T292925

Change 743574 abandoned by Majavah:

[operations/puppet@production] toolforge: provision delete-crashing-pods values

Reason:

old patch, nowadays we do things differently

https://gerrit.wikimedia.org/r/743574

taavi removed taavi as the assignee of this task.Sep 26 2023, 1:56 PM
taavi moved this task from In Progress to Ready to be worked on on the Toolforge board.

Not actively working on this.