[k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jun 13 2024, 9:13 AM

Description

On a quick search:

https://github.com/kyverno/kyverno/issues/8668 - [Bug] Kyverno UpdateRequests flood the API Server
https://github.com/kyverno/kyverno/issues/10049 - [Bug] Kyverno UpdateRequests flood the API Server for v1.11.4
https://github.com/kyverno/kyverno/issues/10308 - [Bug] Add circuit breakers for temporary resources to avoid bringing down the system
https://github.com/kyverno/kyverno/issues/9633 - [Bug] significant decrease in generate rule performance (+ too many UpdateRequests)

As of this writing, all of them expect the first one are open tickets, meaning they are not considered resolved by upstream.

However, none of the upstream tickets perfectly match our setup. We have about 3.5k policies, with 2 rules each.

We may want to submit a ticket upstream to see if we are navigating uncharted waters regarding the scale and the setup of our kyverno deployment -- or we just hit a bug.

Related Objects
Search...

Status	Assigned	Task
In Progress	Raymond_Ndibe	T362867 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28
Resolved	Raymond_Ndibe	T359641 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27
		Restricted Task
Resolved	Slst2020	T327025 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26
Resolved	aborrero	T316107 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25
Resolved	aborrero	T279110 [infra] Replace PodSecurityPolicy in Toolforge Kubernetes
Resolved	aborrero	T364297 [k8s,infra] track PSP migration plan
Resolved	Andrew	T367348 Incident: 2024-06-12 toolforge k8s control plane
Resolved	aborrero	T367386 [k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways

Event Timeline

aborrero changed the task status from Open to In Progress.Jun 13 2024, 9:13 AM

aborrero triaged this task as High priority.

aborrero created this task.

aborrero moved this task from Backlog to Doing on the User-aborrero board.

dcaro renamed this task from toolforge: kyverno has a track record of overloading the cluster, maybe on new ways to [k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways.Jun 13 2024, 9:59 AM

opened upstream ticket: https://github.com/kyverno/kyverno/issues/10458

aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/142

helpers: add toolforge_kyverno_load_many_resources.sh

aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/337

kyverno: reintroduce resource limits

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/142

helpers: add toolforge_kyverno_load_many_resources.sh

I tested this:

increased the number of cluster nodes in lima-kilo https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/commit/64b49733fc0e56b4d652141074f5ba336c07081e
reintroduced kyverno resource limits https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/337
created 4000 kyverno policies in one namespace, with this script https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/commit/837c7a113235663a60288d682cf861033d7d2669
try to create a job via toolforge job, and see both the kubernetes control plane components (api-server, controller manager, etc) and kyverno itself crash, making the cluster unusable

I had the hope that introducing resource limits on kyverno would reduce the blast radious.

I have a very bad feeling about this.

I don't think we can move forward with this software :-(

aborrero mentioned this in T367950: Decision Request - Toolforge pod security via custom admission webhook.Jun 19 2024, 8:47 AM

My theory of what is happening here:

Once all the policies are installed, the api-server will forward requests to kyverno so it can validate them. There is a timeout for this, and because the huge amount of rules, kyverno can't respond in time, causing the api-server to deny the request.
Kyverno also scans all resource objects of the cluster in the background to make sure they comply with the policies. This also injects a huge load in the k8s control plane.
I tried with resource limits for kyverno, but it only results in the kyverno pods crashing and getting OOMkilled, which introduces yet more instability to the whole process.

This suggest to me that this may not be the right architecture for what we want to accomplish.

I will be proposing that we stop trying to replace PSP with a policy agent, see T367950: Decision Request - Toolforge pod security via custom admission webhook.

I sent additional information to upstream, in particular I shared with them how to reproduce the problem, in case they are interested:

https://github.com/kyverno/kyverno/issues/10458#issuecomment-2178997459

Upstream replied with a couple of questions and a few recommendations. I will run another test and report back.

Another data point: yesterday @fnegri pointed me to their slack channel #kyverno. In the channel, I saw an event invitation for the same day about a new feature in kyverno that aparently deals with performance and cluster overload.

See https://community.cncf.io/events/details/cncf-cncf-online-programs-presents-cloud-native-live-kyvernos-report-server-a-new-approach-to-policy-report-management/

Managing policy and governance in busy Kubernetes clusters was difficult due to the high volume of policy reports, cluster policy reports, and ephemeral reports generated by Kyverno. This caused overloading of the API server and etcd, leading to poor cluster performance. Kyverno's new Reports Server addresses this issue by offloading these reports to a separate database, resulting in a 70% reduction in etcd consumption. Attend this session to discover how the Kyverno team tackled this complex problem using API Aggregation and the advantages of storing reports in a dedicated database.

aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/149

helpers/toolforge_kyverno_load_many_resources.sh: be more realistic

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/149

helpers/toolforge_kyverno_load_many_resources.sh: be more realistic

I did further tests, including:

having a more realistic load test script: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/blob/825be38c17b849745d878e593b2d90eb7e0ac466/helpers/toolforge_kyverno_load_many_resources.sh
increased base memory for the lima-kilo lima-vm from 8GiB to 16GiB
on the kyverno side, applied upstream recommendations regarding optional services, resource limits and replicas, see https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/f4995ab9c9b2d85e64110134f646e0cc62838fba/components/kyverno/values/common/settings.yaml

With all this, I was able to create 4000 policy resources, each on its own namespace, without a single crash on any of the relevant components, with a clean run of the functional tests:

local.tf-test@lima-kilo:~$ toolforge_run_functional_tests.sh 
builds-api/build-smoke-test.bats
 ✓ start build [449]
 ✓ list build [462]
 ✓ tail logs and wait (slow) [176648]
 ✓ show finished build (slow) [470]
 ✓ delete build [1277]
 ✓ quota [491]
 ✓ delete all [454]
 ✓ clean [876]
envvars-api/envvars-smoke-test.bats
 ✓ create envvar [445]
 ✓ list envvars [461]
 ✓ show envvars [456]
 ✓ envvars are set inside jobs [21202]
 ✓ delete envvar [1311]
 ✓ quota [479]
jobs-api/continuous-job-healthcheck.bats
 ✓ run a continuous job with script healthcheck passing [15567]
 ✓ run a continuous job with script healthcheck failing [15229]
jobs-api/continuous-job-port.bats
 ✓ run a continuous job without port shows no port [1388]
 ✓ run a continuous job with a port shows port [1423]
 ✓ run a continuous job with a port exposes port [45637]
jobs-api/continuous-job-smoke-test.bats
 ✓ run a simple continuous job [26391]
jobs-api/dump-and-load.bats
 ✓ do a simple dump and load [3237]
 ✓ doing a load does not flush all other jobs (T364204) [3729]
jobs-api/one-off-job-smoke-test.bats
 ✓ run a simple one-off job [26308]
jobs-api/scheduled-job-smoke-test.bats
 ✓ run a simple scheduled job [33406]
webservice/webservice-smoke-test.bats
 ✓ status of stopped webservice [450]
 ✓ start webservice [16090]
 ✓ get logs [43812]
 ✓ restart [16040]
 - can be reached by external url (skipped) [36]
 ✓ stop [1074]

30 tests, 0 failures, 1 skipped in 458 seconds

I have reported upstream my latest results here: https://github.com/kyverno/kyverno/issues/10458#issuecomment-2180491360

aborrero mentioned this in T367388: [k8s,infra] consider scaling the k8s control plane.Jun 20 2024, 12:19 PM

aborrero mentioned this in T368044: Toolforge: redeploy kyverno after the outage.Jun 20 2024, 12:48 PM

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/337

kyverno: reintroduce resource limits

We redeployed kyverno, see T368044: Toolforge: redeploy kyverno after the outage and seems happy at the moment.

Maintenance_bot removed a project: Patch-For-Review.Jun 20 2024, 1:31 PM

[k8s,infra] kyverno has a track record of overloading the cluster, maybe on new waysClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways
Closed, ResolvedPublic
Actions

Related Objects
Search...