Problem
In the context of the migration away from Kubernetes PodSecurityPolicy into something more modern, we decided to go with Kyverno, see:
- T279110: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes
- T362233: Decision Request - Toolforge policy agent
However, introducing Kyverno was proven to be very difficult (or impossible) due to the scale of Toolforge, see:
- T367348: Incident: 2024-06-12 toolforge k8s control plane
- T367386: [k8s,infra] kyverno has a track record of overloading the cluster, maybe on new ways
The other policy agent option, OPA Gatekeeper, has a similar architecture and works in a similar fashion, and may be a similar dead end.
We should decide if continue with this approach, or write our own custom admission webhook controller, like we do for a few other things.
Constraints and risks
- High engineering time resources.
Decision record
TBD https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_XYZ_Toolforge_pod_security_via_custom_admission_webhook
Options
Option 1
Forget about policy agents, create our own custom admission webhooks to enforce pod security settings.
We already maintain a few other custom admission controllers, and having another one should be no big deal. However, there is definitely code to write and maintain.
Pros:
- The most performant option (compared to kyverno at least)
- Well supported pattern within the kubernetes ecosystem.
- We already have other custom admission controllers, we should know how to do this already.
Cons:
- New code to write and maintain.
Option 2
Keep trying and researching with policy agents. If kyverno does not work, try with OPA Gatekeeper.
Maybe we did not get right the policy agents setup/architecture, and there is a way to craft them into performing well.
Perhaps there is an -unknown at the moment- way to enforce the same pod configuration, but without 3.5k policy rules.
We could work with upstream to make sure our use case is supported, and that what we want to do is possible, and makes sense.
Pros:
- We would use an upstream project instead of our own source code. Others are writing and maintaining it.
Cons:
- There are strong evidences that this is a dead end.