Page MenuHomePhabricator

[k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways
Closed, ResolvedPublic

Description

On a quick search:

As of this writing, all of them expect the first one are open tickets, meaning they are not considered resolved by upstream.

However, none of the upstream tickets perfectly match our setup. We have about 3.5k policies, with 2 rules each.

We may want to submit a ticket upstream to see if we are navigating uncharted waters regarding the scale and the setup of our kyverno deployment -- or we just hit a bug.

Event Timeline

aborrero changed the task status from Open to In Progress.Jun 13 2024, 9:13 AM
aborrero triaged this task as High priority.
aborrero created this task.
aborrero moved this task from Backlog to Doing on the User-aborrero board.
dcaro renamed this task from toolforge: kyverno has a track record of overloading the cluster, maybe on new ways to [k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways.Jun 13 2024, 9:59 AM

I tested this:

I had the hope that introducing resource limits on kyverno would reduce the blast radious.

I have a very bad feeling about this.

I don't think we can move forward with this software :-(

aborrero claimed this task.

My theory of what is happening here:

Once all the policies are installed, the api-server will forward requests to kyverno so it can validate them. There is a timeout for this, and because the huge amount of rules, kyverno can't respond in time, causing the api-server to deny the request.
Kyverno also scans all resource objects of the cluster in the background to make sure they comply with the policies. This also injects a huge load in the k8s control plane.
I tried with resource limits for kyverno, but it only results in the kyverno pods crashing and getting OOMkilled, which introduces yet more instability to the whole process.

This suggest to me that this may not be the right architecture for what we want to accomplish.

I will be proposing that we stop trying to replace PSP with a policy agent, see T367950: Decision Request - Toolforge pod security via custom admission webhook.

I sent additional information to upstream, in particular I shared with them how to reproduce the problem, in case they are interested:

https://github.com/kyverno/kyverno/issues/10458#issuecomment-2178997459

Upstream replied with a couple of questions and a few recommendations. I will run another test and report back.

Another data point: yesterday @fnegri pointed me to their slack channel #kyverno. In the channel, I saw an event invitation for the same day about a new feature in kyverno that aparently deals with performance and cluster overload.

See https://community.cncf.io/events/details/cncf-cncf-online-programs-presents-cloud-native-live-kyvernos-report-server-a-new-approach-to-policy-report-management/

Managing policy and governance in busy Kubernetes clusters was difficult due to the high volume of policy reports, cluster policy reports, and ephemeral reports generated by Kyverno. This caused overloading of the API server and etcd, leading to poor cluster performance. Kyverno's new Reports Server addresses this issue by offloading these reports to a separate database, resulting in a 70% reduction in etcd consumption. Attend this session to discover how the Kyverno team tackled this complex problem using API Aggregation and the advantages of storing reports in a dedicated database.

I did further tests, including:

With all this, I was able to create 4000 policy resources, each on its own namespace, without a single crash on any of the relevant components, with a clean run of the functional tests:

local.tf-test@lima-kilo:~$ toolforge_run_functional_tests.sh 
builds-api/build-smoke-test.bats
 ✓ start build [449]
 ✓ list build [462]
 ✓ tail logs and wait (slow) [176648]
 ✓ show finished build (slow) [470]
 ✓ delete build [1277]
 ✓ quota [491]
 ✓ delete all [454]
 ✓ clean [876]
envvars-api/envvars-smoke-test.bats
 ✓ create envvar [445]
 ✓ list envvars [461]
 ✓ show envvars [456]
 ✓ envvars are set inside jobs [21202]
 ✓ delete envvar [1311]
 ✓ quota [479]
jobs-api/continuous-job-healthcheck.bats
 ✓ run a continuous job with script healthcheck passing [15567]
 ✓ run a continuous job with script healthcheck failing [15229]
jobs-api/continuous-job-port.bats
 ✓ run a continuous job without port shows no port [1388]
 ✓ run a continuous job with a port shows port [1423]
 ✓ run a continuous job with a port exposes port [45637]
jobs-api/continuous-job-smoke-test.bats
 ✓ run a simple continuous job [26391]
jobs-api/dump-and-load.bats
 ✓ do a simple dump and load [3237]
 ✓ doing a load does not flush all other jobs (T364204) [3729]
jobs-api/one-off-job-smoke-test.bats
 ✓ run a simple one-off job [26308]
jobs-api/scheduled-job-smoke-test.bats
 ✓ run a simple scheduled job [33406]
webservice/webservice-smoke-test.bats
 ✓ status of stopped webservice [450]
 ✓ start webservice [16090]
 ✓ get logs [43812]
 ✓ restart [16040]
 - can be reached by external url (skipped) [36]
 ✓ stop [1074]

30 tests, 0 failures, 1 skipped in 458 seconds

We redeployed kyverno, see T368044: Toolforge: redeploy kyverno after the outage and seems happy at the moment.