Page MenuHomePhabricator

maintain-kubeusers fails to recover from interrupted certificate creation due to mismatch between CSR name and delete request
Closed, ResolvedPublic

Description

Missing k8s creds for the ores-inspect tool were reported on irc. Investigation showed that the maintain-kubeusers pod was in CrashLoopBackOff. The error it reported before dying each time was:

$ kubectl logs -f maintain-kubeusers-7f7b44754c-sffzd -n maintain-kubeusers
starting a run
CSR for tool-ores-inspect already exists, deleting
Traceback (most recent call last):
  File "/app/maintain_kubeusers/k8s_api.py", line 152, in generate_csr
    self.create_new_csr(private_key, user, org_name)
  File "/app/maintain_kubeusers/k8s_api.py", line 145, in create_new_csr
    self.certs.create_certificate_signing_request(body=csr_body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 57, in create_certificate_signing_request
    (data) = self.create_certificate_signing_request_with_http_info(body, **kwargs)  # noqa: E501
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 141, in create_certificate_signing_request_with_http_info
    collection_formats=collection_formats)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 345, in call_api
    _preload_content, _request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 176, in __call_api
    _request_timeout=_request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 388, in request
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 278, in POST
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 26 Feb 2021 19:52:10 GMT', 'Content-Length': '306'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"certificatesigningrequests.certificates.k8s.io \"tool-ores-inspect\" already exists","reason":"AlreadyExists","details":{"name":"tool-ores-inspect","group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":409}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/maintain_kubeusers.py", line 7, in <module>
    runpy.run_module("maintain_kubeusers", run_name="__main__")
  File "/usr/lib/python3.7/runpy.py", line 208, in run_module
    return _run_code(code, {}, init_globals, run_name, mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/app/maintain_kubeusers/__main__.py", line 7, in <module>
    main()
  File "/app/maintain_kubeusers/cli.py", line 163, in main
    tools, cur_users["tools"], k8s_api, args.gentle_mode
  File "/app/maintain_kubeusers/utils.py", line 42, in process_new_users
    k8s_api.generate_csr(user_list[user_name].pk, user_name)
  File "/app/maintain_kubeusers/k8s_api.py", line 162, in generate_csr
    user, body=client.V1DeleteOptions()
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 168, in delete_certificate_signing_request
    (data) = self.delete_certificate_signing_request_with_http_info(name, **kwargs)  # noqa: E501
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 261, in delete_certificate_signing_request_with_http_info
    collection_formats=collection_formats)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 345, in call_api
    _preload_content, _request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 176, in __call_api
    _request_timeout=_request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 411, in request
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 268, in DELETE
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 26 Feb 2021 19:52:10 GMT', 'Content-Length': '286'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"certificatesigningrequests.certificates.k8s.io \"ores-inspect\" not found","reason":"NotFound","details":{"name":"ores-inspect","group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":404}

Manual inspection of CSRs showed:

$ kubectl get csr
NAME                AGE     REQUESTOR                                                  CONDITION
tool-ores-inspect   3h33m   system:serviceaccount:maintain-kubeusers:user-maintainer   Approved,Issued

There is a mismatch in the code: the cleanup that tries to delete a pending CSR is not adding the "tool-" prefix to the expected CSR name, so it fails to delete it. Manually running kubectl delete csr/tool-ores-inspect unblocked things and got the service running as expected again.

Event Timeline

The cleanup code was introduced by @Bstorm for T271847: Improve cleanup behavior on failure for maintain-kubeusers. I think that this block is missing something:

maintain_kubeusers/k8s_api.py
self.certs.delete_certificate_signing_request(
    user, body=client.V1DeleteOptions()
)

Specifically I think it is missing something like the metadata=client.V1ObjectMeta(name="tool-{}".format(user)), line that exists in the body for the CSR created in create_new_csr.

bd808 renamed this task from maintain-kubeusers fails to recover from interrupted certificate creation due to typo to maintain-kubeusers fails to recover from interrupted certificate creation due to mismatch between CSR name and delete request.Feb 26 2021, 8:26 PM

Change 667300 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] cleanup: cleaning up CSRs only works if you try to delete the right one

https://gerrit.wikimedia.org/r/667300

Change 667300 merged by jenkins-bot:
[labs/tools/maintain-kubeusers@master] cleanup: cleaning up CSRs only works if you try to delete the right one

https://gerrit.wikimedia.org/r/667300

Mentioned in SAL (#wikimedia-cloud) [2021-02-27T02:23:56Z] <bstorm> deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better T275910

Bstorm claimed this task.