Improve cleanup behavior on failure for maintain-kubeusers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Jan 12 2021, 6:26 PM

Description

We've seen maintain-kubeusers fail at least twice since we moved etcd to ceph (see T267966). This was caused by an etcd timeout causing it's k8s requests to fail, but it could not continue without manual intervention because it had completed creation of the CSR.

Fix this vulnerability by having it clean up its own messes or recognize the valid CSR and use it instead, whichever is easier or better.

This way, a failure causing a restart will restore functionality (also look for other areas where restart on failure won't work).

Details

	Subject	Repo	Branch	Lines +/-
	crashes: retry creating CSRs after crashes	labs/tools/maintain-kubeusers	master	+25 -6

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	• Bstorm	T271842 maintain-kubeusers broken in Toolforge
		Resolved		• Bstorm	T271847 Improve cleanup behavior on failure for maintain-kubeusers

Event Timeline

• Bstorm triaged this task as High priority.Jan 12 2021, 6:26 PM

• Bstorm created this task.

Change 656000 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] crashes: retry creating CSRs after crashes

https://gerrit.wikimedia.org/r/656000

gerritbot added a project: Patch-For-Review.Jan 13 2021, 10:09 PM

Change 656000 merged by jenkins-bot:
[labs/tools/maintain-kubeusers@master] crashes: retry creating CSRs after crashes

https://gerrit.wikimedia.org/r/656000

Mentioned in SAL (#wikimedia-cloud) [2021-01-21T15:29:47Z] <bstorm> pushed the maintain-kubeusers:beta tag with the new code to the docker repo T271847

Maintenance_bot removed a project: Patch-For-Review.Jan 21 2021, 4:10 PM

Confirmed this is working on toolsbeta using the toolsctl script. Deploying to tools.

Deployed. This shouldn't cause crash loops for this service anymore.

Mentioned in SAL (#wikimedia-cloud) [2021-01-21T23:58:38Z] <bstorm> deployed new maintain-kubeusers to tools T271847

bd808 mentioned this in T275910: maintain-kubeusers fails to recover from interrupted certificate creation due to mismatch between CSR name and delete request.Feb 26 2021, 8:24 PM

Improve cleanup behavior on failure for maintain-kubeusersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Improve cleanup behavior on failure for maintain-kubeusers
Closed, ResolvedPublic
Actions

Related Objects
Search...