Page MenuHomePhabricator

maintain-kubeusers container in CrashLoopBackoff preventing new tool creation after 'user-maintainer' ClusterRole changes
Closed, ResolvedPublicBUG REPORT

Description

[21:28]  <  legoktm> maybe I'm being impatient, but it's taking longer than I expected for the "tour-nyc" tool I created to become... become-able
[21:38]  <    bd808> legoktm: it usually takes 5-10 minutes, but there might be something wrong with maintain-kubeusers?
[21:38]  <  legoktm> hm, we're at ~15min now
[21:38]  <    bd808> yeah, that's a bit too long
[21:39]  <    bd808> now I have to remember where this stuff runs...
[21:39]  <  legoktm> I'm supposed to demo this live in uhh...an hour, but I'm just canabalizing another spare tool I have lying around for now
[21:39]  <  legoktm> so no urgency specifically from me other than it is broken :p
[21:45]  <    bd808> !log admin maintain-kubeusers container in CrashLoopBackoff, investigating

Event Timeline

bd808 triaged this task as High priority.Mar 8 2023, 9:49 PM
bd808 moved this task from Inbox to Clinic Duty on the cloud-services-team board.
$ kubectl sudo -n maintain-kubeusers logs maintain-kubeusers-b6c6d7c5c-kh75v
starting a run
Homedir already exists for /data/project/chtholly
Wrote config in /data/project/chtholly/.kube/config
PodSecurityPolicy tool-chtholly-psp already exists
Namespace tool-chtholly already exists
Role tool-chtholly-psp already exists
Could not create toolforge-tfb-psp role for chtholly
Traceback (most recent call last):
  File "/app/maintain_kubeusers.py", line 7, in <module>
    runpy.run_module("maintain_kubeusers", run_name="__main__")
  File "/usr/lib/python3.9/runpy.py", line 228, in run_module
    return _run_code(code, {}, init_globals, run_name, mod_spec)
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/maintain_kubeusers/__main__.py", line 7, in <module>
    main()
  File "/app/maintain_kubeusers/cli.py", line 162, in main
    new_tools = process_new_users(
  File "/app/maintain_kubeusers/utils.py", line 63, in process_new_users
    k8s_api.add_user_access(user_list[user_name])
  File "/app/maintain_kubeusers/k8s_api.py", line 800, in add_user_access
    self.process_buildpack_rbac(user.name)
  File "/app/maintain_kubeusers/k8s_api.py", line 659, in process_buildpack_rbac
    _ = self.rbac.create_namespaced_role(
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/api/rbac_authorization_v1_api.py", line 324, in create_namespaced_role
    return self.create_namespaced_role_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/api/rbac_authorization_v1_api.py", line 419, in create_namespaced_role_with_http_info
    return self.api_client.call_api(
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
    return self.request("POST", url,
  File "/app/venv/lib/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'e76349e3-a0be-4217-9dea-10b63204e554', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd77de175-725f-414a-9b4e-8b719e411c2c', 'Date': 'Wed, 08 Mar 2023 21:49:20 GMT', 'Content-Length': '627'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"roles.rbac.authorization.k8s.io \"tfb-chtholly-psp\" is forbidden: user \"system:serviceaccount:maintain-kubeusers:user-maintainer\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:maintain-kubeusers\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"extensions\"], Resources:[\"podsecuritypolicies\"], ResourceNames:[\"toolforge-tfb-psp\"], Verbs:[\"use\"]}","reason":"Forbidden","details":{"name":"tfb-chtholly-psp","group":"rbac.authorization.k8s.io","kind":"roles"},"code":403}

Quick hack to get things working again:

$ kubectl sudo -n maintain-kubeusers edit ClusterRole user-maintainer
  # Add this back in:
- apiGroups:
  - extensions
  resources:
  - podsecuritypolicies
  verbs:
  - create
  - get
  - list
  - patch
  - update
  - use
  - delete
$ kubectl sudo -n maintain-kubeusers logs -f maintain-kubeusers-b6c6d7c5c-kh75v
starting a run
Homedir already exists for /data/project/chtholly
Wrote config in /data/project/chtholly/.kube/config
PodSecurityPolicy tool-chtholly-psp already exists
Namespace tool-chtholly already exists
Role tool-chtholly-psp already exists
Provisioned creds for user chtholly
Homedir already exists for /data/project/del-simple-bad-redirects
Wrote config in /data/project/del-simple-bad-redirects/.kube/config
PodSecurityPolicy tool-del-simple-bad-redirects-psp already exists
Namespace tool-del-simple-bad-redirects already exists
Role tool-del-simple-bad-redirects-psp already exists
Provisioned creds for user del-simple-bad-redirects
Homedir already exists for /data/project/tour-nyc
Wrote config in /data/project/tour-nyc/.kube/config
PodSecurityPolicy tool-tour-nyc-psp already exists
Namespace tool-tour-nyc already exists
Role tool-tour-nyc-psp already exists
Provisioned creds for user tour-nyc
Homedir already exists for /data/project/ashleybot
Wrote config in /data/project/ashleybot/.kube/config
Renewed creds for tool ashleybot
Homedir already exists for /data/project/terabot
Wrote config in /data/project/terabot/.kube/config
Renewed creds for tool terabot
finished run, wrote 3 new accounts, disabled 2 accounts, cleaned up 0 accounts

Mentioned in SAL (#wikimedia-cloud) [2023-03-08T22:31:29Z] <bd808> Live hacked user-maintainer clusterrole to work around breakage in T331572

The service is working again because I live hacked the missing podsecuritypolicies grants back into the user-maintainer clusterrole, but this will break again as soon as someone runs the helm deploy for maintain-kubeusers.

Hopefully @taavi or @aborrero can figure out what code changes are needed in the maintain-kubeusers python code to match the podsecuritypolicies removal for the pending Kubernetes upgrade.

bd808 renamed this task from maintain-kubeusers container in CrashLoopBackoff preventing new tool creation to maintain-kubeusers container in CrashLoopBackoff preventing new tool creation after 'user-maintainer' ClusterRole changes.Mar 8 2023, 10:36 PM

Sorry I did not notice that. Looks like it's the creation of the RBAC rule which fails, not the creation of the PSP itself.

Change 896019 had a related patch set uploaded (by Majavah; author: Majavah):

[labs/tools/maintain-kubeusers@master] k8s_api: rbac: use 'policy' group

https://gerrit.wikimedia.org/r/896019

Change 896019 merged by jenkins-bot:

[labs/tools/maintain-kubeusers@master] k8s_api: rbac: use 'policy' group

https://gerrit.wikimedia.org/r/896019

This bug is fixed. T331619 is tracking a follow-up.