Decision Request - Toolforge policy agent enforcement model
Open, MediumPublic
Actions

Assigned To

Authored By

	aborrero
	Thu, Apr 18, 11:36 AM

Description

Problem

Regardeless of the policy agent we finally decide for Toolforge (see T362233: Decision Request - Toolforge policy agent), and in addition to that decision, we also need to decide between a couple of options regarding how we want to enforce the different resource security policies, which may have some differences in the semantics and behavior of the platform.

Both Kyverno and OPA Gatekeeper can work in different modes:

enforcement via validation: reject resource definitions that doesn't meet the policies.
enforcement via mutation: mutate resource definitions so they conform with the policies. this is how PodSecurityPolicy has been working so far
no enforcement, only audit: all resources will be evaluated against the policies, and an audit record will be created.

Example of validation:

given a policy that requires every Pod resource to have allowPrivilegeEscalation: false
if somebody tries to create a Pod resource with allowPrivilegeEscalation: true, reject it. An error message will be produced.

Example of mutation:

given a policy that requires every Pod resource to have allowPrivilegeEscalation: false
every time a Pod resource is created, mutate it (modify it) to add allowPrivilegeEscalation: false. No error message will be produced.
this is how PodSecurityPolicy has been working so far

Example of audit:

given a policy that requires every Pod resource to have allowPrivilegeEscalation: false
if a Pod resource doesn't conform to the policy, emit an audit record (but otherwise do nothing else).

Constraints and risks

this affects both for ourselves, in the different -api components we have, and tool developers that have direct access to the k8s API.
semantics are different, and require a different level of commitment, specially for users of the k8s API directly.

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T362872_Toolforge_policy_agent_enforcement_model

Options

Option 1

Enforcement via validation.

This makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, given they have to manually adapt and code to conform to them.

Pros:

possibly the simplest
the semantic is explicit: if a policy violation happens, an visible error will be produced.

Cons:

may require code updates, to conform the policies.
given policies can change, these code updates may be required on a continuous basis
not how PSP has been working so far

Option 2

Enforcement via mutation.

This doesn't makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, because mutation is taking care of updating the resources to conform to policies.

Pros:

this is how PodSecurityPolicy has been working so far
transparent enforcement for everyone, no error messages to decode
less code updates to track policy changes

Cons:

people are less aware of the different policies we have in Toolforge kubernetes
a piece of software arbitrarily updating resources sound a bit scary.
it is not clear how mutation would work for policy changes and already present resources. I.e a given Pod was mutated to conform policy on date X. But the policy has now changed. What do we do with the already defined Pod?

Option 3

Combination:

validation for optional policies
mutation for mandatory policies

Given there could be resource attributes that could be optional. We could introduce some kind of mixed approach.

Pros:

maybe the most flexible approach?

Cons:

perhaps the most confusing semantic? as there are things happening automagically, and others requiring explicit code changes.

Related Objects
Search...

Status	Assigned	Task
Open	None	T362869 [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30)
Open	None	T362868 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29
Open	None	T362867 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28
Open	None	T359641 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27
Open	None	T327025 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26
Open	None	T316107 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25
In Progress	aborrero	T279110 [infra] Replace PodSecurityPolicy in Toolforge Kubernetes
Open	aborrero	T362872 Decision Request - Toolforge policy agent enforcement model
Resolved	aborrero	T363347 toolforge lima-kilo: PodSecurityPolicy admission is disabled

Event Timeline

aborrero created this task.Thu, Apr 18, 11:36 AM

aborrero changed the task status from Open to In Progress.Thu, Apr 18, 11:39 AM

aborrero triaged this task as Medium priority.

aborrero moved this task from Inbox to Needs discussion on the cloud-services-team board.

aborrero moved this task from Backlog to Blocked on the User-aborrero board.

With the little knowledge of the details, I would lean on enforced validation so far, if anything or anyone is trying to set an option that's not allowed, they are better knowing about it.

Though I'll wait until more info is added to decide.

aborrero updated the task description. (Show Details)Thu, Apr 18, 11:51 AM

aborrero updated the task description. (Show Details)Thu, Apr 18, 11:54 AM

fnegri added a project: Cloud Services Proposals.Thu, Apr 18, 12:34 PM

fnegri moved this task from Inbox to Discussion on the Cloud Services Proposals board.

I think Option 1 is my preference.

What do we do with the already defined Pod?

This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?

dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.Thu, Apr 18, 1:43 PM

In T362872#9726419, @fnegri wrote:

What do we do with the already defined Pod?

This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?

Kyverno has a pretty well established behavior for this, see https://kyverno.io/docs/policy-reports/background/ it will report if already existing resources no longer conform with the new policies.

On a similar fashion, OPA Gatekeeper can perform audits of existing resources, see https://open-policy-agent.github.io/gatekeeper/website/docs/audit

In T362872#9734666, @aborrero wrote:

In T362872#9726419, @fnegri wrote:

What do we do with the already defined Pod?

This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?

Kyverno has a pretty well established behavior for this, see https://kyverno.io/docs/policy-reports/background/ it will report if already existing resources no longer conform with the new policies.

On a similar fashion, OPA Gatekeeper can perform audits of existing resources, see https://open-policy-agent.github.io/gatekeeper/website/docs/audit

In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.

In T362872#9734678, @aborrero wrote:

In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.

I would assume the policy is applied when a Pod is created, and not when a Deployment/Job/etc is? In that case backfill support doesn't seem that important to me.

In T362872#9734680, @taavi wrote:

In T362872#9734678, @aborrero wrote:

In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.

I would assume the policy is applied when a Pod is created, and not when a Deployment/Job/etc is? In that case backfill support doesn't seem that important to me.

At least in the case of kyverno (most likely with OPA gatekeeper too), it auto-creates policies for resources that generate resources.

For example, if you create a policy to enforce a Pod-level securityContext, kyverno will auto-generate a policy for Deployment/Job/CronJob/DaemonSet resources to enforce the same in their pod template. This can be disabled, but I find it useful.

I'm fine with option 1 too, so I'm declaring this decision request done.

aborrero updated the task description. (Show Details)Wed, Apr 24, 11:27 AM

reopening, as I just noticed an important data point: as of today PodSecurityPolicy work on mutation mode. It transparently modifies the resources being defined in the cluster.

So, we should take into account that mutation mode (option 2) is the option that keeps the existing behavior.

Please, share your thoughts.

aborrero updated the task description. (Show Details)Wed, Apr 24, 3:08 PM

aborrero updated the task description. (Show Details)Wed, Apr 24, 3:11 PM

In T362872#9740822, @aborrero wrote:

reopening, as I just noticed an important data point: as of today PodSecurityPolicy work on mutation mode. It transparently modifies the resources being defined in the cluster.

So, we should take into account that mutation mode (option 2) is the option that keeps the existing behavior.

Please, share your thoughts.

Is there a way for us to see how many objects are currently not meeting the policy? If there's not many, going with option 1 might be doable, otherwise it might requires some time to get everything first valid with the policies, and then moving to option 1 (if we want eventually)

In T362872#9740880, @dcaro wrote:

Is there a way for us to see how many objects are currently not meeting the policy? If there's not many, going with option 1 might be doable, otherwise it might requires some time to get everything first valid with the policies, and then moving to option 1 (if we want eventually)

if PSP have been doing its job well so far, the number of objects not meeting the policy should be zero.

But, also, as far as I can tell, PSP only mutates the resource for non-default/non-optional policies. For example, it will inject the RunAs:54321 config, because there is no way such config can be default (it is per-tool). But privileged: false it doesn't mutate because it is supposed to be the default.

aborrero mentioned this in T363347: toolforge lima-kilo: PodSecurityPolicy admission is disabled.Wed, Apr 24, 3:29 PM

aborrero changed the status of subtask T363347: toolforge lima-kilo: PodSecurityPolicy admission is disabled from Open to In Progress.

aborrero closed subtask T363347: toolforge lima-kilo: PodSecurityPolicy admission is disabled as Resolved.Thu, Apr 25, 12:30 PM

In T362872#9752551, @dcaro wrote:

The decision about commiting to drop the extra component on the upgrade to k8s 1.26 might become way more relevant [..]

This is maybe the wrong ticket? We also have T362233: Decision Request - Toolforge policy agent

In T362872#9757243, @aborrero wrote:

In T362872#9752551, @dcaro wrote:

The decision about commiting to drop the extra component on the upgrade to k8s 1.26 might become way more relevant [..]

This is maybe the wrong ticket? We also have T362233: Decision Request - Toolforge policy agent

yep :S

aborrero mentioned this in T364297: toolforge: create a PSP migration plan.Mon, May 6, 12:27 PM

Decision Request - Toolforge policy agent enforcement modelOpen, MediumPublicActions

Description

Problem

Constraints and risks

Decision record

Options

Option 1

Option 2

Option 3

Related ObjectsSearch...

Event Timeline

Decision Request - Toolforge policy agent enforcement model
Open, MediumPublic
Actions

Related Objects
Search...