Page MenuHomePhabricator

Decision Request - Toolforge policy agent enforcement model
Open, MediumPublic

Description

Problem

Regardeless of the policy agent we finally decide for Toolforge (see T362233: Decision Request - Toolforge policy agent), and in addition to that decision, we also need to decide between a couple of options regarding how we want to enforce the different resource security policies, which may have some differences in the semantics and behavior of the platform.

Both Kyverno and OPA Gatekeeper can work in different modes:

  • enforcement via validation: reject resource definitions that doesn't meet the policies.
  • enforcement via mutation: mutate resource definitions so they conform with the policies. this is how PodSecurityPolicy has been working so far
  • no enforcement, only audit: all resources will be evaluated against the policies, and an audit record will be created.

Example of validation:

  • given a policy that requires every Pod resource to have allowPrivilegeEscalation: false
  • if somebody tries to create a Pod resource with allowPrivilegeEscalation: true, reject it. An error message will be produced.

Example of mutation:

  • given a policy that requires every Pod resource to have allowPrivilegeEscalation: false
  • every time a Pod resource is created, mutate it (modify it) to add allowPrivilegeEscalation: false. No error message will be produced.
  • this is how PodSecurityPolicy has been working so far

Example of audit:

  • given a policy that requires every Pod resource to have allowPrivilegeEscalation: false
  • if a Pod resource doesn't conform to the policy, emit an audit record (but otherwise do nothing else).

Constraints and risks

  • this affects both for ourselves, in the different -api components we have, and tool developers that have direct access to the k8s API.
  • semantics are different, and require a different level of commitment, specially for users of the k8s API directly.

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T362872_Toolforge_policy_agent_enforcement_model

Options

Option 1

Enforcement via validation.

This makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, given they have to manually adapt and code to conform to them.

Pros:

  • possibly the simplest
  • the semantic is explicit: if a policy violation happens, an visible error will be produced.

Cons:

  • may require code updates, to conform the policies.
  • given policies can change, these code updates may be required on a continuous basis
  • not how PSP has been working so far

Option 2

Enforcement via mutation.

This doesn't makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, because mutation is taking care of updating the resources to conform to policies.

Pros:

  • this is how PodSecurityPolicy has been working so far
  • transparent enforcement for everyone, no error messages to decode
  • less code updates to track policy changes

Cons:

  • people are less aware of the different policies we have in Toolforge kubernetes
  • a piece of software arbitrarily updating resources sound a bit scary.
  • it is not clear how mutation would work for policy changes and already present resources. I.e a given Pod was mutated to conform policy on date X. But the policy has now changed. What do we do with the already defined Pod?

Option 3

Combination:

  • validation for optional policies
  • mutation for mandatory policies

Given there could be resource attributes that could be optional. We could introduce some kind of mixed approach.

Pros:

  • maybe the most flexible approach?

Cons:

  • perhaps the most confusing semantic? as there are things happening automagically, and others requiring explicit code changes.

Event Timeline

aborrero changed the task status from Open to In Progress.Thu, Apr 18, 11:39 AM
aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Needs discussion on the cloud-services-team board.
aborrero moved this task from Backlog to Blocked on the User-aborrero board.

With the little knowledge of the details, I would lean on enforced validation so far, if anything or anyone is trying to set an option that's not allowed, they are better knowing about it.

Though I'll wait until more info is added to decide.

I think Option 1 is my preference.

What do we do with the already defined Pod?

This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?

What do we do with the already defined Pod?

This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?

Kyverno has a pretty well established behavior for this, see https://kyverno.io/docs/policy-reports/background/ it will report if already existing resources no longer conform with the new policies.

On a similar fashion, OPA Gatekeeper can perform audits of existing resources, see https://open-policy-agent.github.io/gatekeeper/website/docs/audit

What do we do with the already defined Pod?

This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?

Kyverno has a pretty well established behavior for this, see https://kyverno.io/docs/policy-reports/background/ it will report if already existing resources no longer conform with the new policies.

On a similar fashion, OPA Gatekeeper can perform audits of existing resources, see https://open-policy-agent.github.io/gatekeeper/website/docs/audit

In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.

In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.

I would assume the policy is applied when a Pod is created, and not when a Deployment/Job/etc is? In that case backfill support doesn't seem that important to me.

In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.

I would assume the policy is applied when a Pod is created, and not when a Deployment/Job/etc is? In that case backfill support doesn't seem that important to me.

At least in the case of kyverno (most likely with OPA gatekeeper too), it auto-creates policies for resources that generate resources.

For example, if you create a policy to enforce a Pod-level securityContext, kyverno will auto-generate a policy for Deployment/Job/CronJob/DaemonSet resources to enforce the same in their pod template. This can be disabled, but I find it useful.

aborrero claimed this task.

I'm fine with option 1 too, so I'm declaring this decision request done.

reopening, as I just noticed an important data point: as of today PodSecurityPolicy work on mutation mode. It transparently modifies the resources being defined in the cluster.

So, we should take into account that mutation mode (option 2) is the option that keeps the existing behavior.

Please, share your thoughts.

reopening, as I just noticed an important data point: as of today PodSecurityPolicy work on mutation mode. It transparently modifies the resources being defined in the cluster.

So, we should take into account that mutation mode (option 2) is the option that keeps the existing behavior.

Please, share your thoughts.

Is there a way for us to see how many objects are currently not meeting the policy? If there's not many, going with option 1 might be doable, otherwise it might requires some time to get everything first valid with the policies, and then moving to option 1 (if we want eventually)

Is there a way for us to see how many objects are currently not meeting the policy? If there's not many, going with option 1 might be doable, otherwise it might requires some time to get everything first valid with the policies, and then moving to option 1 (if we want eventually)

if PSP have been doing its job well so far, the number of objects not meeting the policy should be zero.

But, also, as far as I can tell, PSP only mutates the resource for non-default/non-optional policies. For example, it will inject the RunAs:54321 config, because there is no way such config can be default (it is per-tool). But privileged: false it doesn't mutate because it is supposed to be the default.

<commented in the wrong ticket>

The decision about commiting to drop the extra component on the upgrade to k8s 1.26 might become way more relevant [..]

This is maybe the wrong ticket? We also have T362233: Decision Request - Toolforge policy agent

The decision about commiting to drop the extra component on the upgrade to k8s 1.26 might become way more relevant [..]

This is maybe the wrong ticket? We also have T362233: Decision Request - Toolforge policy agent

yep :S