toolforge: Automatically stop pods in CrashLoopBackOff (and notify tool maintainers)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	taavi
	Oct 10 2021, 4:40 PM

Description

The Toolforge Kubernetes cluster has a suprisingly large number of pods that are always in CrashLoopBackOff state. No-one seems to notice or care about said failing pods and the pods likely can't do anything useful if they are constantly crashing, but they still consume resources on the cluster.

I propose adding a cronjob of some sort to regularly look for pods that haven't been able to start for a while and just removes that pod (and any deployments that control it). The script should send a warning in advance for the tool maintainers to fix the problem and a notification after killing it.

Details

Subject	Repo	Branch	Lines +/-
toolforge: provision delete-crashing-pods values	operations/puppet	production	+47 -0
Add support for email notifications	cloud/toolforge/delete-crashing-pods	master	+100 -18
Initial commit	cloud/toolforge/delete-crashing-pods	master	+1 K -0

Customize query in gerrit

Related Objects

Mentioned In: rCTDC5c6227ae0d32: Add support for email notifications
rCTDC6c9baf231fac: Initial commit

Event Timeline

taavi created this task.Oct 10 2021, 4:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 10 2021, 4:40 PM

• nskaggs triaged this task as Medium priority.Oct 12 2021, 5:40 PM

taavi claimed this task.Oct 24 2021, 7:11 PM

Change 742237 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/delete-crashing-pods@master] Initial commit

https://gerrit.wikimedia.org/r/742237

gerritbot added a project: Patch-For-Review.Nov 28 2021, 12:23 PM

Change 742237 merged by jenkins-bot:

[cloud/toolforge/delete-crashing-pods@master] Initial commit

https://gerrit.wikimedia.org/r/742237

taavi mentioned this in rCTDC6c9baf231fac: Initial commit.Dec 4 2021, 12:16 PM

Mentioned in SAL (#wikimedia-cloud) [2021-12-04T12:18:29Z] <majavah> deploying delete-crashing-pods in dry run mode T292925

Change 743574 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] toolforge: provision delete-crashing-pods values

https://gerrit.wikimedia.org/r/743574

The currently implemented parts of the cron job seem to work fine. Next step is to write the notification email sending code and puppetize installing the values file.

taavi moved this task from Backlog to In Progress on the Toolforge board.Dec 4 2021, 6:17 PM

Change 745984 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/delete-crashing-pods@master] Add support for email notifications

https://gerrit.wikimedia.org/r/745984

Change 745984 merged by jenkins-bot:

[cloud/toolforge/delete-crashing-pods@master] Add support for email notifications

https://gerrit.wikimedia.org/r/745984

Mentioned in SAL (#wikimedia-cloud) [2021-12-14T09:46:22Z] <majavah> testing delete-crashing-pods emailer component with a test tool T292925

taavi mentioned this in rCTDC5c6227ae0d32: Add support for email notifications.Dec 14 2021, 9:46 AM

• bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 27 2022, 9:31 PM

Change 743574 abandoned by Majavah:

[operations/puppet@production] toolforge: provision delete-crashing-pods values

Reason:

old patch, nowadays we do things differently

https://gerrit.wikimedia.org/r/743574

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:25 PM

fnegri moved this task from Kanban to Doing? (legacy column) on the cloud-services-team board.

fnegri moved this task from Doing? (legacy column) to Inbox on the cloud-services-team board.Jan 19 2023, 1:02 PM

Not actively working on this.

Maintenance_bot removed a project: Patch-For-Review.Sep 26 2023, 2:13 PM

dcaro moved this task from Ready to be worked on to Workspace for triaging whenever needed on the Toolforge board.Jan 24 2024, 1:51 PM

dcaro moved this task from Workspace for triaging whenever needed to Ready to be worked on on the Toolforge board.Feb 21 2024, 4:03 PM

toolforge: Automatically stop pods in CrashLoopBackOff (and notify tool maintainers)Open, MediumPublicActions

Description

Details

Related Objects

Event Timeline

toolforge: Automatically stop pods in CrashLoopBackOff (and notify tool maintainers)
Open, MediumPublic
Actions