Page MenuHomePhabricator

[maintain-kubeusers,infra,k8s]: introduce some logic to backfill maintain-kubeuser resources (like per-tool kyverno policies)
Closed, ResolvedPublic

Description

In the context of T279110: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes, we need create some mechanism that can backfill existing tools with the new kyverno policies. Otherwise, maintain-kubeusers will only create them for new accounts.

Since this is a problem we have had in the past -and we'll have every time we change the maintain-kubueusers resources- we could make such mechanism generic.

See also:

Event Timeline

aborrero triaged this task as Medium priority.
aborrero moved this task from Backlog to Next on the User-aborrero board.
aborrero moved this task from Backlog to Ready to be worked on on the Toolforge board.

We can have maintain-kubeusers to inject a couple of labels to all resources:

  • app.kubernetes.io/managed-by: maintain-kubeusers
  • toolforge.org/maintain-kubeusers-git-id: 5bf5e0447b258c3925d248509c5f9c250d2d85d3

Both a missing namespaces or a namespace with the wrong git-id are triggers for maintain-kubeusers to operate.

This means, that when we deploy a new version of maintain-kubeusers, it will loop at least once over all tool account namespaces to:

  • query for all namespaced objects that match the first label, but not the second, which is an indication of a resource that needs to be refreshed/recreated.
  • add maybe-missing resources (new with this git-id)
  • remove maybe-leftover resources (no longer tracked in this git-id)
  • refresh the git-id in the namespace resource.

The basic idea sounds good to me. Using the Git hash means that all tools will be processed on the first boot after every maintain-kubeusers commit, which is fine as long as the processing code is written correctly.

There is currently a ConfigMap in each namespace used to track certificate expiration and quota versions already, can we re-use that instead of introducing a new mechanism (namespace label) for tracking per-tool state?

working on a resource abstraction that would allow to store state of each resource in the configmap

Code here: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/23

aborrero changed the task status from Open to In Progress.May 22 2024, 1:34 PM
aborrero moved this task from Next to Doing on the User-aborrero board.
dcaro renamed this task from toolforge: introduce some logic to backfill maintain-kubeuser resources (like per-tool kyverno policies) to [maintain-kubeusers,infra,k8s]: introduce some logic to backfill maintain-kubeuser resources (like per-tool kyverno policies).May 22 2024, 1:35 PM

I got this trace when deploying in toolsbeta:

aborrero@toolsbeta-test-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers logs --timestamps=true maintain-kubeusers-6f556984f7-k425x
2024-05-27T15:55:02.052487165Z starting a run
2024-05-27T15:55:02.385499173Z account: 'gitlab-webhooks-beta' resource: 'namespace' needs create
2024-05-27T15:55:02.441507238Z namespace tool-gitlab-webhooks-beta already exists
2024-05-27T15:55:02.488481482Z account: 'k8s-status' resource: 'namespace' needs create
2024-05-27T15:55:02.521815198Z namespace tool-k8s-status already exists
2024-05-27T15:55:02.556980835Z account: 'automated-toolforge-tests' resource: 'namespace' needs create
2024-05-27T15:55:02.591086603Z namespace tool-automated-toolforge-tests already exists
2024-05-27T15:55:02.606811164Z account: 'automated-toolforge-tests' resource: 'buildpackrbac' needs create
2024-05-27T15:55:02.687468589Z Role tfb-automated-toolforge-tests-psp already exists
2024-05-27T15:55:02.757116235Z Could not create tfb-automated-toolforge-tests-tool-binding rolebinding for automated-toolforge-tests
2024-05-27T15:55:02.757166087Z Traceback (most recent call last):
2024-05-27T15:55:02.757178207Z   File "/app/maintain_kubeusers_cli.py", line 6, in <module>
2024-05-27T15:55:02.757392228Z     runpy.run_module("maintain_kubeusers", run_name="__main__")
2024-05-27T15:55:02.757446234Z   File "<frozen runpy>", line 229, in run_module
2024-05-27T15:55:02.757456512Z   File "<frozen runpy>", line 88, in _run_code
2024-05-27T15:55:02.757605223Z   File "/app/maintain_kubeusers/__main__.py", line 6, in <module>
2024-05-27T15:55:02.757642900Z     main()
2024-05-27T15:55:02.757669866Z   File "/app/maintain_kubeusers/cli.py", line 156, in main
2024-05-27T15:55:02.757776629Z     do_run(
2024-05-27T15:55:02.757793842Z   File "<decorator-gen-1>", line 2, in do_run
2024-05-27T15:55:02.757803131Z   File "/opt/lib/python/site-packages/prometheus_client/context_managers.py", line 80, in wrapped
2024-05-27T15:55:02.757927589Z     return func(*args, **kwargs)
2024-05-27T15:55:02.758032999Z            ^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.758076578Z   File "/app/maintain_kubeusers/cli.py", line 82, in do_run
2024-05-27T15:55:02.758155670Z     tools[tool].reconcile()
2024-05-27T15:55:02.758188668Z   File "/app/maintain_kubeusers/user.py", line 70, in reconcile
2024-05-27T15:55:02.758468871Z     self.resources.reconcile()
2024-05-27T15:55:02.758486068Z   File "/app/maintain_kubeusers/resources/resources.py", line 145, in reconcile
2024-05-27T15:55:02.758642074Z     resource_data = resource.do_create(resource_data)
2024-05-27T15:55:02.758671519Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.758683717Z   File "/app/maintain_kubeusers/resources/buildpack_rbac.py", line 91, in do_create
2024-05-27T15:55:02.758861981Z     _ = self.rbac.create_namespaced_role_binding(
2024-05-27T15:55:02.758874281Z         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.758886890Z   File "/opt/lib/python/site-packages/kubernetes/client/api/rbac_authorization_v1_api.py", line 478, in create_namespaced_role_binding
2024-05-27T15:55:02.759309482Z     return self.create_namespaced_role_binding_with_http_info(namespace, body, **kwargs)  # noqa: E501
2024-05-27T15:55:02.759330884Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.759342019Z   File "/opt/lib/python/site-packages/kubernetes/client/api/rbac_authorization_v1_api.py", line 577, in create_namespaced_role_binding_with_http_info
2024-05-27T15:55:02.759780312Z     return self.api_client.call_api(
2024-05-27T15:55:02.759802733Z            ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.759815446Z   File "/opt/lib/python/site-packages/kubernetes/client/api_client.py", line 348, in call_api
2024-05-27T15:55:02.759992856Z     return self.__call_api(resource_path, method,
2024-05-27T15:55:02.760005599Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.760153170Z   File "/opt/lib/python/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
2024-05-27T15:55:02.760190376Z     response_data = self.request(
2024-05-27T15:55:02.760199935Z                     ^^^^^^^^^^^^^
2024-05-27T15:55:02.760237437Z   File "/opt/lib/python/site-packages/kubernetes/client/api_client.py", line 391, in request
2024-05-27T15:55:02.760472646Z     return self.rest_client.POST(url,
2024-05-27T15:55:02.760502485Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.760512876Z   File "/opt/lib/python/site-packages/kubernetes/client/rest.py", line 275, in POST
2024-05-27T15:55:02.760682927Z     return self.request("POST", url,
2024-05-27T15:55:02.760694893Z            ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-27T15:55:02.760703373Z   File "/opt/lib/python/site-packages/kubernetes/client/rest.py", line 234, in request
2024-05-27T15:55:02.761366368Z     raise ApiException(http_resp=r)
2024-05-27T15:55:02.761818002Z kubernetes.client.exceptions.ApiException: (403)
2024-05-27T15:55:02.761884544Z Reason: Forbidden
2024-05-27T15:55:02.761899814Z HTTP response headers: HTTPHeaderDict({'Audit-Id': '87f4ad25-cce7-4e3f-9f55-4e1094d1da3d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '33437296-db1d-4588-a29f-8fc7ddb1879a', 'X-Kubernetes-Pf-Prioritylevel-Uid': '9f010111-8a93-464a-b4b2-8d7ad4ff37f0', 'Date': 'Mon, 27 May 2024 15:55:02 GMT', 'Content-Length': '693'})
2024-05-27T15:55:02.762309323Z HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rolebindings.rbac.authorization.k8s.io \"tfb-automated-toolforge-tests-tool-binding\" is forbidden: user \"system:serviceaccount:maintain-kubeusers:user-maintainer\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:maintain-kubeusers\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"extensions\"], Resources:[\"podsecuritypolicies\"], ResourceNames:[\"toolforge-tfb-psp\"], Verbs:[\"use\"]}","reason":"Forbidden","details":{"name":"tfb-automated-toolforge-tests-tool-binding","group":"rbac.authorization.k8s.io","kind":"rolebindings"},"code":403}
2024-05-27T15:55:02.762328424Z 
2024-05-27T15:55:02.762338522Z 
``

apparently, the error is related to something we don't use. I dropped the resource from the maintain-kubeusers refactor and sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036640 .

now deployed in toolsbeta without errors

Deployment plan:

Rollback plan, in case of small, simple issue:

Rollback plan, in case of not a small, simple issue:

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/298

maintain-kubeusers: bump to 0.0.134-20240603123028-c8c2ea33