Page MenuHomePhabricator

Improve cleanup behavior on failure for maintain-kubeusers
Closed, ResolvedPublic

Description

We've seen maintain-kubeusers fail at least twice since we moved etcd to ceph (see T267966). This was caused by an etcd timeout causing it's k8s requests to fail, but it could not continue without manual intervention because it had completed creation of the CSR.

Fix this vulnerability by having it clean up its own messes or recognize the valid CSR and use it instead, whichever is easier or better.

This way, a failure causing a restart will restore functionality (also look for other areas where restart on failure won't work).

Event Timeline

Bstorm created this task.

Change 656000 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] crashes: retry creating CSRs after crashes

https://gerrit.wikimedia.org/r/656000

Change 656000 merged by jenkins-bot:
[labs/tools/maintain-kubeusers@master] crashes: retry creating CSRs after crashes

https://gerrit.wikimedia.org/r/656000

Mentioned in SAL (#wikimedia-cloud) [2021-01-21T15:29:47Z] <bstorm> pushed the maintain-kubeusers:beta tag with the new code to the docker repo T271847

Confirmed this is working on toolsbeta using the toolsctl script. Deploying to tools.

Bstorm claimed this task.

Deployed. This shouldn't cause crash loops for this service anymore.

Mentioned in SAL (#wikimedia-cloud) [2021-01-21T23:58:38Z] <bstorm> deployed new maintain-kubeusers to tools T271847