Page MenuHomePhabricator

Certificate generation is broken in toolsbeta
Closed, ResolvedPublic

Description

At some point, toolsbeta stopped successfully producing Kubernetes accounts. The problem is that it generates a CSR and after approval, it doesn't actually append a certificate to the CSR for use (the normal workflow for the certificates API).

After investigating for some time, I cannot quite find a conclusive error or issue. maintain-kubeusers is in a crashloop with a misleading error:

Certificate creation stalled or failed for <whatever>
Path /data/project/<tool>/.toolskube is not writable or failed to store certs somehow
Traceback (most recent call last):
  File "maintain_kubeusers.py", line 7, in <module>
    runpy.run_module("maintain_kubeusers", run_name="__main__")
  File "/usr/lib/python3.7/runpy.py", line 208, in run_module
    return _run_code(code, {}, init_globals, run_name, mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/app/maintain_kubeusers/__main__.py", line 7, in <module>
    main()
  File "/app/maintain_kubeusers/cli.py", line 174, in main
    tools, cur_users["tools"], k8s_api, args.gentle_mode
  File "/app/maintain_kubeusers/utils.py", line 55, in process_new_users
    api_server, ca_data, gentle
  File "/app/maintain_kubeusers/user.py", line 261, in write_kubeconfig
    self.write_certs()
  File "/app/maintain_kubeusers/user.py", line 271, in write_certs
    cert_file.write(self.cert)
TypeError: a bytes-like object is required, not 'NoneType'

Basically, it tries to get the bytes object from the certificate field of the status of the approved CSR and gets "None" because the field itself isn't even there. It is behaving as though there is no cluster signer enabled, but the appropriate cli args are clearly on the kube-controller-managers.

I've been investigating the new signerName whatnot, which totally breaks some workflows, btw, but that doesn't seem directly related since you can still use the unknown-legacy signer in v1beta1. This doesn't seem to be affecting tools.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-08-21T00:17:56Z] <bstorm> rebooting the control plane nodes for kubernetes because it can't make things worse T289390

This could be related to changes we've made to update, but those same changes are confirmed there on tools. So I don't think it's directly related.

The fact that the v1 version of the API cannot create certificates for external services like webhooks just sucks and is quite true (there's a PR related to this in k8s right now).

Bstorm claimed this task.

Ok, that's really annoying. Reboot fixed it.