Page MenuHomePhabricator

PodSecurityPolicies will be deprecated with Kubernetes 1.21
Open, HighPublic

Description

Pod Security Policies (PSP), starting with the Kubernetes 1.21, will begin the process of deprecation with the intention to fully remove it in a future release. ...
Full blog post draft here)
Github issue at: https://github.com/kubernetes/kubernetes/pull/97171

While we started implementing PSPs in T228967, they never fully made it to our clusters (as of k8s <1.16).
With Kubernetes 1.16 upgrades we want to implement the recommended restrictions as far as possible without to much effort (that we might have to re-spend with the deprecation). Although there are alternative options around currently, we still have some time and it can be assumed that those options evolve in the near future and we can migrate off of PSPs at a later point.

Things we currently enforce via PSPs (which a replacement needs to provide as well):

  • Prohibit running privileged containers, hostIPC, hostPID, hostNetwork
  • Ensure containers run as non-root
  • Restrict the use of volumes (only specific volume plugins, only specific host paths)
  • Prohibit containers with fsGroup/supplementalGroup of 0
  • Ensure capabilities are dropped

Apart from the "privileged" policy that effectively allows everything (required for things like calico for example) we only have two PSPs in wikikube. For details about them, see helmfile.d/admin_ng/helmfile_psp.yaml.

WMCS tasks about it:

Research about the way forward / alternatives

https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement

Preparation for migrating away from PSPs

Run validation against the restricted PSS profile

Remove seccomp.security.alpha.kubernetes.io/defaultProfileName

Remove apparmor.security.beta.kubernetes.io/defaultProfileName

  • Mutating PSP that adds the Pod annotation container.apparmor.security.beta.kubernetes.io/<container_name> for each container in the pod
  • There is no replacement for this functionality
  • docker-shim does run all containers with AppArmor profile docker-default (enforce) by default (regardless of the annotation), so there is no need to add the annotations to each Pod
#!/bin/bash
# check-apparmor.sh - Lists all processes in docker containers not running with the docker-default AppArmor profile
docker ps -q | xargs docker inspect --format '{{.State.Pid}} {{.Name}}' | while read ppid name; do
	pids="${ppid} $(pgrep -P $ppid)"
	for pid in $ppid $pids; do
		profile=$(cat /proc/$pid/attr/current)
		if [ "${profile}" != "docker-default (enforce)" ]; then
			echo "${name} ${pid} $(tr '\0' ' ' </proc/$pid/cmdline) is running with AppArmor profile ${profile}"
		fi
	done
done
  • Remove the mutating annotation from PSP(s)

Event Timeline

JMeybohm created this task.
JMeybohm raised the priority of this task from Low to High.Aug 31 2023, 8:57 AM
JMeybohm updated the task description. (Show Details)

From the documentation that I've read the PSP controller has been replaced by the PSA (Pod Security Admission), that implements three classes of Policy Security Standards:

  • Privileged - Unrestricted policy, providing the widest possible level of permissions. This policy allows for known privilege escalations.
  • Baseline - Minimally restrictive policy which prevents known privilege escalations. Allows the default (minimally specified) Pod configuration.
  • Restricted - Heavily restricted policy, following current Pod hardening best practices.

The main problem seems to be that one cannot configure those classes, you have to either take it or implement a custom admission webhook.

From this doc:

Policy level definitions are hardcoded and unconfigurable out of the box. However, the Pod Security Standards leave open ended guidance on a few items, so we must make a static decision on how to handle these elements...

In PSS' profile details I see that some fields are configurable, but again IIUC it is related to what you can configure in Pod/Deployment resources. There seems to be no way to customize anything in the PSS policy itself.

One nice feature is listed in the migration from PSP to PSA guide: we can add the PSA settings to a namespace via labels, in enforce mode or just audit mode.

The first step suggested is to eliminate mutating fields (not supported anymore), in our case the ones that we use are:

seccomp.security.alpha.kubernetes.io/defaultProfileName:  'runtime/default'
apparmor.security.beta.kubernetes.io/defaultProfileName:  'runtime/default'
allowedCapabilities

The second step is to eliminate options not covered by PSA, in our case:

.spec.allowedHostPaths`
.spec.fsGroup

Then the next step is to map PSP to PSA using this table. Once we have the target for each namespace, one could start testing the new Admission Controller (that will simply throw warnings if something doesn't meet the chosen criteria).

The current PSP that we declare are:

  • privileged (that should map well to PSA's privileged)
  • restricted (that should map well to PSA's restricted)
  • mediawiki (that contains some options that are not implemented/allowed in PSA).

For example, in mediawiki we use the hostPaths /usr/share/GeoIP and /usr/share/GeoIPInfo to share them to the pods. We could think about adding them to every image that uses them, but it would mean ~ +1GB used in the final Docker image:

elukey@kubernetes2010:~$ du -hs /usr/share/GeoIP
GeoIP/     GeoIPInfo/ 
elukey@kubernetes2010:~$ sudo du -hs /usr/share/GeoIPInfo/
695M	/usr/share/GeoIPInfo/

We also allow SYS_PTRACE among the capabilities, that is not even allowed in the PSA's Baseline class (that could be a target for mediawiki).

Last but not least, we use fsGroup to avoid the root group to be added to the Pod (and limit privileges escalation I suppose), but there is not correspondent restriction in Baseline (I think only in Restricted, but it may be too much for the replacement of mediawiki).

The above is only related to production k8s, I didn't check the cloud realm (more info in T279110).

In my opinion we should decide if we want to test/keep/move-to PSA (so if the 3 classes are enough etc..) or if we need more. In the first case we keep testing and exploring, in the latter we may want to review tools like Open Policy Agent, Kyverno, Kubewarden or GateKeeper (to name a few that I found).

Thanks for putting this together. IIUC the decision we need to make is basically: "Are we okay with running MediaWiki with the Privileged profile" because all other profiles won't allow SYS_PTRACE which we definitely need because it's required for slowlogs to work.

Pro for going with PSA:

  • Built-in component, no additional maintenance
  • Migration path seems straight forward
  • No need for another DSL for rule definition (which I suppose 3rd party components have)

Con for going with PSA:

  • We need to run MediaWiki with the Privileged profile
  • We won't be able to selectively allow hostPath mounts in the future

One thing that I didn't get yet is if we could run PSA alongside with Open Policy Agent, that could be a compromise. I suspect there shouldn't be any problem, but it would need to be tested. We'd use base kubernetes tools for most of the use cases, and for the complex ones like MediaWiki we use OPA. One could say "but if you add OPA then we might just use it for everything", that could be a road of course, but it may also mean another big complexity for people adding a service to k8s (if not automated by us, but then complexity falls on our shoulders :D).

Running mediawiki in priviledged mode seems to be risky, I'd vote for testing OPA, even if it is a lot of work :(

The pod security standards are available on 1.23, so I tried the dry-run mode to see how warnings would look like. For example, the following command is clearly wrong but it is nice to see the warnings:

root@deploy2002:~# kubectl label --dry-run=server --overwrite ns kube-system pod-security.kubernetes.io/enforce=restricted
Warning: existing pods in namespace "kube-system" violate the new PodSecurity enforce level "restricted:latest"
Warning: calico-kube-controllers-6c95b745f5-thbp7 (and 2 other pods): allowPrivilegeEscalation != false, unrestricted capabilities, runAsNonRoot != true, seccompProfile
Warning: calico-node-6c8f5 (and 3 other pods): host namespaces, hostPath volumes, privileged, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile
Warning: calico-typha-67b94b488f-dkp7h: host namespaces, hostPort, unrestricted capabilities, seccompProfile
Warning: coredns-644f987cc9-9nnk8 (and 1 other pod): unrestricted capabilities, runAsNonRoot != true, seccompProfile
namespace/kube-system labeled

The mapping to Pod Security Standards should be relatively painless, since our restricted and privileged should map nicely to PSS.

In eqiad:

root@deploy2002:~# kubectl label --dry-run=server --overwrite ns mw-web pod-security.kubernetes.io/enforce=restricted
Warning: existing pods in namespace "mw-web" violate the new PodSecurity enforce level "restricted:latest"
Warning: mw-web.eqiad.canary-76c8f6b969-4wbts (and 26 other pods): non-default capabilities, hostPath volumes, unrestricted capabilities, restricted volume types, runAsNonRoot != true
namespace/mw-web labeled

That confirms what we discussed above, namely that our mediawiki PSP policy clashes with PSS' restricted.

While checking docs about OPA, I came across this:

https://kubernetes.io/blog/2022/12/20/validating-admission-policies-alpha/

IIUC k8s upstream introduced a way to apply custom validations without the need of an extra webhook (like OPA Gatekeeper), but the new feature is alpha in 1.26 and beta in 1.28. More investigation is needed, but it seems possible to have a scenario like the following for Mediawiki namespaces:

  • default Pod Security Standard set to restricted
  • Custom ValidationAdmissionPolicy resources to allow extra bits like hostPath, etc..

I would personally try to spend some time in understanding the ValidationAdmissionPolicy feature before starting a big work of moving all our clusters to OPA Gatekeeper.

This comment was removed by elukey.
JMeybohm moved this task from ⎈Kubernetes to Doing 😎 on the serviceops board.

I would personally try to spend some time in understanding the ValidationAdmissionPolicy feature before starting a big work of moving all our clusters to OPA Gatekeeper.

I did that. Unfortunately we can't permit something in a Validation Admission Policy that is forbidden by PSS (at that check comes first). So "extending" the PSS for the mediawiki namespaces (by allowing ptrace/hostPath again) is unfortunately not an option.

With that I do think we have the following options to proceed:

  1. Exempt mw namespaces (from PSS/PSA), e.g. running without checks
  2. Allow privileged in mw namespaces
  3. Use some 3rd party controller like opa-gatekeeper for the mw namespaces
  4. Allow privileged in mw namespaces, but create ValidationAdmissionPolicies basically re-implementing the restricted profile with the exemption of ptrace/hostPath for geoip

The last option will unfortunately leave a gap as we can't migrate to ValidationAdmissionPolicies early (as in "before the k8s upgrade") because the feature is only available in k8s >=1.26 (>=1.28 really if we want to avoid alpha). It will probably also be a bit work to get it right and we might upgrade the ValidationAdmissionPolicies with k8s releases (reflecting changes to the PSS). Overall this looks doable and might be better than relying on some 3rd party thing that might be phased out with ValidationAdmissionPolicies becoming the new standard. I'll try to prove the viability of this idea in a demo setup.

I'm not super convinced of 4...writing ValidationAdmissionPolicies is quite complex and there are so many corner cases. I tried implementing the first restrictions from the baseline profile (I think we need around 10) and it's already huge:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
  name: "rebuild-baseline"
spec:
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments", "statefulsets", "daemonsets", "jobs"]
  validations:
  - message: "Containers must drop ALL capabilities and might only add back SYS_PTRACE"
    expression: |
      // Containers must drop `ALL` capabilities,
      (
        object.spec.template.spec.containers +
        (has(object.spec.template.spec.initContainers) ? object.spec.template.spec.initContainers : []) +
        (has(object.spec.template.spec.ephemeralContainers) ? object.spec.template.spec.ephemeralContainers : [])
      ).all(container,
        has(container.securityContext) &&
        has(container.securityContext.capabilities) &&
        has(container.securityContext.capabilities.drop) &&
        size(container.securityContext.capabilities.drop) >= 1 &&
        container.securityContext.capabilities.drop.exists(c, c == 'ALL')
      ) &&
      // and are only permitted to add back the `SYS_PTRACE` capability
      (
        object.spec.template.spec.containers +
        (has(object.spec.template.spec.initContainers) ? object.spec.template.spec.initContainers : []) +
        (has(object.spec.template.spec.ephemeralContainers) ? object.spec.template.spec.ephemeralContainers : [])
      ).all(container,
        !has(container.securityContext) ||
        !has(container.securityContext.capabilities) ||
        !has(container.securityContext.capabilities.add) ||
        container.securityContext.capabilities.add.all(cap, cap == 'SYS_PTRACE')
      )
  - message: "securityContext.runAsNonRoot must be set on Pod or Container level and may not be false"
    expression: |
      // Pod or Containers must set `securityContext.runAsNonRoot`
      (
        (
          has(object.spec.template.spec.securityContext) &&
          has(object.spec.template.spec.securityContext.runAsNonRoot)
        ) ||
        object.spec.template.spec.containers.all(container,
          has(container.securityContext) && has(container.securityContext.runAsNonRoot)) 
        // No need to check initContainer and ephemeralContainer here as container is required
      )
      &&
      // Neither Pod nor Containers should set `securityContext.runAsNonRoot` to false
      (
        (
          // Pod should not set runAsNonRoot to false
          !has(object.spec.template.spec.securityContext) ||
          !has(object.spec.template.spec.securityContext.runAsNonRoot) ||
          object.spec.template.spec.securityContext.runAsNonRoot != false
        ) &&
        (
          (
            object.spec.template.spec.containers +
            (has(object.spec.template.spec.initContainers) ? object.spec.template.spec.initContainers : []) +
            (has(object.spec.template.spec.ephemeralContainers) ? object.spec.template.spec.ephemeralContainers : [])
          ).all(container,
            !has(container.securityContext) ||
            !has(container.securityContext.runAsNonRoot) ||
            container.securityContext.runAsNonRoot != false
          )
        )
      )

Plus this is still not super restrictive as it only covers "deployments", "statefulsets", "daemonsets" and "jobs". We need a second set with the same rules but matching pods, so that we don't have to cover every possible way a pod could be spawned (cronjobs, operators creating pods directly, probably other ways I did not think of yet). That would have the downside of late errors, e.g. deployments would go through but pod's won't be created later on. But otoh that downside also applies to current PSP's, so maybe it's fine to only implement rules for Pods and go with that.

I've summarized my findings at https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement @akosiaris, @elukey: I'd like you to take a look and ask questions if you find the time.

@JMeybohm thanks a lot for the great wikipage, it explains the problem very well. The only thing that worries me is the maintenance of those extra policies, since multiple things can fail (Kyverno can stop/change their support, etc..) and also it would add a big and complicated step when upgrading to future k8s versions. I don't see any other way forward though, so I am +1 with your proposal.

Side note: we could try to avoid the hostPath bundling GeoIP inside mw Docker images, to reduce the scope of the "exceptions" to SYS_PTRACE.

During the SIG meeting we wondered what is the feedback that a deployer would get from PSS vs VAP+CEL, we knew the latter (namely the Deployment/Pod/etc.. resources are allowed to be created but the corresponding resource would not be created if a policy is breached) but not the former.

I tried a little test on minikube, creating a test namespace and applying to it the PSS restricted profile. I tried then to create a pod object that violates its restriction, and I got this error straight away:

Error from server (Forbidden): error when creating "test_pod.yaml": pods "test-pd" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "test-container" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test-container" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "test-volume" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "test-container" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test-container" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Posting the yaml as well:

apiVersion: v1
kind: Pod
metadata:
  name: test-pd
spec:
  containers:
  - image: k8s.gcr.io/test-webserver
    name: test-container
    volumeMounts:
    - mountPath: /test-pd
      name: test-volume
  volumes:
  - name: test-volume
    hostPath:
      # directory location on host
      path: /data
      # this field is optional
      type: DirectoryOrCreate

In my case I wanted to trigger the hostPath restriction but others were breached as well.

The test shows that we'd have two different feedback for deployers:

  • For namespaces using PSS (probably most of the current ones) we'd get an error while deploying, so the pod resources wouldn't be created.
  • For namespaces using VAP (Mediawiki for the moment) we wouldn't get an error while deploying, but we'd not see resources/pods being created (getting feedback from stuff like kubectl get events IIUC).

Not a big deal but probably worth to be highlighted during the decision time. Still keep my vote to proceed with Janis' solution of course!

Do the PSS give the same early feedback even with Deployment objects?

Tested this random example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-nas
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        command: ["sh", "-c"]
        args: ["sleep 10000"]
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_ADMIN"]
          allowPrivilegeEscalation: true
        volumeMounts:
          - name: dynamic-volume
            mountPropagation: "Bidirectional"
            mountPath: "/dynamic-volume"
      volumes:
        - name: dynamic-volume
          hostPath:
            path: /mnt/dynamic-volume
            type: DirectoryOrCreate

And the result was:

Warning: would violate PodSecurity "restricted:latest": privileged (container "nginx" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]; container "nginx" must not include "SYS_ADMIN" in securityContext.capabilities.add), restricted volume types (volume "dynamic-volume" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

deployment.apps/test-deployment created

So the Deployment resource gets created, with a warning. Then its status is of course not healthy:

NAME              READY   UP-TO-DATE   AVAILABLE   AGE
test-deployment   0/1     0            0           2m33s

And get events shows:

2m59s       Warning   FailedCreate        replicaset/test-deployment-5b8457ff6   Error creating: pods "test-deployment-5b8457ff6-cxss2" is forbidden: violates PodSecurity "restricted:latest": privileged (container "nginx" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]; container "nginx" must not include "SYS_ADMIN" in securityContext.capabilities.add), restricted volume types (volume "dynamic-volume" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Change #1015354 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s/apiserver: Add option to configure audit logging

https://gerrit.wikimedia.org/r/1015354

Change #1015354 merged by JMeybohm:

[operations/puppet@production] k8s/apiserver: Add option to configure audit logging

https://gerrit.wikimedia.org/r/1015354

Change #1016721 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s/apiserver: Fix parameter syntax for --audit-log-maxsize

https://gerrit.wikimedia.org/r/1016721

Change #1016721 merged by JMeybohm:

[operations/puppet@production] k8s/apiserver: Fix parameter syntax for --audit-log-maxsize

https://gerrit.wikimedia.org/r/1016721

Change #1016753 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s: Enable audit logging in staging-eqiad

https://gerrit.wikimedia.org/r/1016753

Change #1016753 merged by JMeybohm:

[operations/puppet@production] k8s: Enable audit logging in staging-eqiad

https://gerrit.wikimedia.org/r/1016753

Change #1018950 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] admin_ng: Refactor fetching pspClusterRole for namespaces

https://gerrit.wikimedia.org/r/1018950

Change #1018951 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] admin_ng: Stop adding kubernetes.io/metadata.name namespace label

https://gerrit.wikimedia.org/r/1018951

Change #1018952 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] admin_ng: Enable restriced PSS profile in audit mode in staging

https://gerrit.wikimedia.org/r/1018952

Change #1018950 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Refactor fetching pspClusterRole for namespaces

https://gerrit.wikimedia.org/r/1018950

Change #1018951 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Stop adding kubernetes.io/metadata.name namespace label

https://gerrit.wikimedia.org/r/1018951

Change #1018952 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Enable restriced PSS profile in audit mode in staging

https://gerrit.wikimedia.org/r/1018952

Change #1019282 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Add support for configuring feature gates

https://gerrit.wikimedia.org/r/1019282

I've added a more comprehensive list of @elukey's test at https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement#Violation_error_handling
Bottom line is: With PSP's and VAP's we only get events, with PSS and kyverno we get additional user warnings (or even full rejections in case of kyverno)

JMeybohm updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)