Page MenuHomePhabricator

Migrate dse cluster off of Pod Security Policies
Closed, ResolvedPublic

Description

As a pre-dependency for the next Kubernetes update, the cluster needs to be migrated from Pod Security Policies to Pod Security Standards.

The process is described in (feel free to extend where you see fit):
https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement

Event Timeline

Change #1052701 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] dse: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052701

Change #1052701 merged by jenkins-bot:

[operations/deployment-charts@master] dse: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052701

brouberol claimed this task.
brouberol subscribed.

No PSS violations occurred in dse-k8s-eqiad in the last 7 days:

Screenshot 2024-09-02 at 09.31.12.png (1×2 px, 438 KB)

The current workload does validate against the new PSS:

root@deploy1003:~# kubectl get ns -l pod-security.kubernetes.io/audit=restricted -o name | while read ns; do
    kubectl label --dry-run=server --overwrite "$ns" pod-security.kubernetes.io/enforce=restricted;
done
namespace/airflow-test-k8s labeled
namespace/cert-manager labeled
namespace/cloudnative-pg-operator labeled
namespace/datahub labeled
namespace/datahub-next labeled
namespace/datasets-config labeled
namespace/datasets-config-next labeled
namespace/echoserver labeled
namespace/external-services labeled
namespace/flink-operator labeled
namespace/growthbook labeled
namespace/istio-system labeled
namespace/mpic labeled
namespace/mpic-next labeled
namespace/postgresql-test labeled
namespace/rdf-streaming-updater labeled
namespace/spark labeled
namespace/spark-history labeled
namespace/spark-history-test labeled
namespace/spark-operator labeled
namespace/superset labeled
namespace/superset-next labeled

We don't currently have any PSP applied in the cluster except privileged or restricted:

root@deploy1003:~# kubectl get pods -A -o=jsonpath='{range .items[?(@.metadata.annotations.kubernetes\.io/psp!="privileged")]}{@.metadata.namespace}{" "}{@.metadata.annotations.kubernetes\.io/psp}{"\n"}{end}' | sort -u | column -t -s' ' | grep -v 'restricted$'
root@deploy1003:~#

Change #1069943 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: Disable mutating parts of the restricted PSP

https://gerrit.wikimedia.org/r/1069943

Change #1069944 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: Enforce the `restricted` PSS for all namespaces

https://gerrit.wikimedia.org/r/1069944

Change #1069945 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] dse-k8s-eqiad: Disable PSP

https://gerrit.wikimedia.org/r/1069945

Change #1069943 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad: Disable mutating parts of the restricted PSP

https://gerrit.wikimedia.org/r/1069943

After having disabled mutating parts of the restricted PSP, I'm going to let things simmer for a while, to let any potential issues creep up.

Change #1070202 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: add restrictedSecurityContext to the git-sync initcontainer

https://gerrit.wikimedia.org/r/1070202

Change #1070202 merged by Brouberol:

[operations/deployment-charts@master] airflow: add restrictedSecurityContext to the git-sync initcontainer

https://gerrit.wikimedia.org/r/1070202

Change #1069944 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad: Enforce the `restricted` PSS for all namespaces

https://gerrit.wikimedia.org/r/1069944

Change #1071571 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: upgrade operator to v1.24.0

https://gerrit.wikimedia.org/r/1071571

Change #1071571 merged by Brouberol:

[operations/deployment-charts@master] cloudnative-pg: upgrade operator to v1.24.0

https://gerrit.wikimedia.org/r/1071571

Change #1071574 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: include the restricted security context in the test pod

https://gerrit.wikimedia.org/r/1071574

Change #1071574 merged by Brouberol:

[operations/deployment-charts@master] cloudnative-pg: include the restricted security context in the test pod

https://gerrit.wikimedia.org/r/1071574

The restricted PSS has been enforced for all namespaces in dse-k8s-eqiad.

Change #1069945 merged by Brouberol:

[operations/puppet@production] dse-k8s-eqiad: Disable PSP

https://gerrit.wikimedia.org/r/1069945

I've run puppet on both Kube masters

brouberol@dse-k8s-ctrl1001:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for dse-k8s-ctrl1001.eqiad.wmnet
Info: Applying configuration version '(24e7fa8440) Brouberol - dse-k8s-eqiad: Disable PSP'
Notice: /Stage[main]/K8s::Apiserver/File[/etc/default/kube-apiserver]/content:
--- /etc/default/kube-apiserver	2024-04-16 10:21:14.527805313 +0000
+++ /tmp/puppet-file20240909-2085104-i9dwim	2024-09-09 11:04:24.094744165 +0000
@@ -12,8 +12,8 @@
  --audit-policy-file=/etc/kubernetes/audit-policy.yaml \
  --authorization-mode=Node,RBAC \
  --client-ca-file=/etc/kubernetes/pki/dse__kube-apiserver_server.chain.pem \
- --disable-admission-plugins=PersistentVolumeClaimResize,StorageObjectInUseProtection \
- --enable-admission-plugins=DenyServiceExternalIPs,NodeRestriction,PodSecurityPolicy \
+ --disable-admission-plugins=PersistentVolumeClaimResize,PodSecurityPolicy,StorageObjectInUseProtection \
+ --enable-admission-plugins=DenyServiceExternalIPs,NodeRestriction \
  --etcd-servers=https://dse-k8s-etcd1001.eqiad.wmnet:2379,https://dse-k8s-etcd1002.eqiad.wmnet:2379,https://dse-k8s-etcd1003.eqiad.wmnet:2379 \
  --kubelet-client-certificate=/etc/kubernetes/pki/dse__kube-apiserver-kubelet-client.pem \
  --kubelet-client-key=/etc/kubernetes/pki/dse__kube-apiserver-kubelet-client-key.pem \

Notice: /Stage[main]/K8s::Apiserver/File[/etc/default/kube-apiserver]/content: content changed '{sha256}64c1337dd43e858add75a7f5356eed925f5db86cc243fc2f126835903a29a6eb' to '{sha256}b732a3cc20d16bd1c522bc9d85f967eb8157e7cd717f50823fc66fdf369e5762'
Info: /Stage[main]/K8s::Apiserver/File[/etc/default/kube-apiserver]: Scheduling refresh of Service[kube-apiserver-safe-restart]

Reopening as we've found out that our Spark operator does not support setting seccompProfile.

Change #1071596 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-operator: enable the definition of securitycontext.seccompProfile for spark containers

https://gerrit.wikimedia.org/r/1071596

Change #1071596 merged by Brouberol:

[operations/deployment-charts@master] spark-operator: enable the definition of securitycontext.seccompProfile for spark containers

https://gerrit.wikimedia.org/r/1071596