Page MenuHomePhabricator

Migrate ml-staging/ml-serve clusters off of Pod Security Policies
Closed, ResolvedPublic5 Estimated Story Points

Description

As a pre-dependency for the next Kubernetes update, the cluster needs to be migrated from Pod Security Policies to Pod Security Standards.

The process is described in (feel free to extend where you see fit):
https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+8 -0
operations/deployment-chartsmaster+5 -10
operations/deployment-chartsmaster+1 -11
operations/deployment-chartsmaster+25 -488
operations/deployment-chartsmaster+3 -2
operations/deployment-chartsmaster+8 -34
operations/puppetproduction+16 -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+3 -4
operations/deployment-chartsmaster+215 -0
operations/deployment-chartsmaster+7 -0
operations/deployment-chartsmaster+16 -0
operations/deployment-chartsmaster+14 -0
operations/deployment-chartsmaster+6 -6
operations/docker-images/production-imagesmaster+137 -1
operations/deployment-chartsmaster+1 -17
operations/deployment-chartsmaster+8 -8
operations/deployment-chartsmaster+27 -7
operations/deployment-chartsmaster+4 -8
operations/deployment-chartsmaster+223 -6
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+4 -1
operations/docker-images/production-imagesmaster+127 -2
operations/deployment-chartsmaster+6 -6
operations/docker-images/production-imagesmaster+1 K -2
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+11 -4
operations/docker-images/production-imagesmaster+319 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -8
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+21 -1
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+20 -3
operations/deployment-chartsmaster+17 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1114423 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: disable PSP mutation for ml-staging-codfw

https://gerrit.wikimedia.org/r/1114423

Change #1114423 merged by Elukey:

[operations/deployment-charts@master] admin_ng: disable PSP mutation for ml-staging-codfw

https://gerrit.wikimedia.org/r/1114423

Change #1115008 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom_deploy.d: rework Istio ML's config

https://gerrit.wikimedia.org/r/1115008

Change #1115008 merged by Elukey:

[operations/deployment-charts@master] custom_deploy.d: rework Istio ML's config

https://gerrit.wikimedia.org/r/1115008

Change #1115322 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: enforce restricted PSS on ml-staging-codfw

https://gerrit.wikimedia.org/r/1115322

Change #1115323 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: disable PSP binding for ml-staging-codfw

https://gerrit.wikimedia.org/r/1115323

Change #1115322 merged by Elukey:

[operations/deployment-charts@master] admin_ng: enforce restricted PSS on ml-staging-codfw

https://gerrit.wikimedia.org/r/1115322

Tried to enforce the restricted PSS, this is the result of killing a revscoring damaging pod in staging:

Error creating: pods "enwiki-damaging-predictor-default-00028-deployment-7d76447ztcxk" is forbidden: violates PodSecurity "restricted:latest": seccompProfile (pod or containers "istio-validation", 
                "storage-initializer", "kserve-container", "queue-proxy", "istio-proxy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

I then tried to move the securityContext for InferenceService (set in the kserve-inference chart) to the pod level, and this was the new error:

Invalid value: "The edited file failed validation": [ValidationError(InferenceService.spec.predictor.securityContext): unknown field "allowPrivilegeEscalation" in 
               io.kserve.serving.v1beta1.InferenceService.spec.predictor.securityContext, ValidationError(InferenceService.spec.predictor.securityContext): unknown field "capabilities" in 
               io.kserve.serving.v1beta1.InferenceService.spec.predictor.securityContext

We want the following:

securityContext:
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      seccompProfile:
        type: RuntimeDefault

Indeed this is the bit in the kserve yaml specs:

securityContext:
  properties:
    fsGroup:
      format: int64
      type: integer
    fsGroupChangePolicy:
      type: string
    runAsGroup:
      format: int64
      type: integer
    runAsNonRoot:
      type: boolean
    runAsUser:
      format: int64
      type: integer
    seLinuxOptions:
      properties:
        level:
          type: string
        role:
          type: string
        type:
          type: string
        user:
          type: string
      type: object
    seccompProfile:
      properties:
        localhostProfile:
          type: string
        type:
          type: string
      required:
      - type
      type: object

And after checking the pod security context specs of k8s upstream, it seems that those two values cannot be set.

I tried to set only the seccompProfile settings only, but I got a validation error from the kserve webhook:

fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): 
                spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile

The above is from knative-serving (confirmed checking its webhook logs), it seems that we need to explicitly allow setting securityContext values via the Knative's config-features configmap. The main issue is that our version, 1.7.x, doesn't allow to set seccomp :(

I see that they added it from 1.8.0.

Change #1115394 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] knative: backport patch from 1.8.x release

https://gerrit.wikimedia.org/r/1115394

Change #1115394 merged by Elukey:

[operations/docker-images/production-images@master] knative: backport patches from 1.8.x release

https://gerrit.wikimedia.org/r/1115394

Change #1116826 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set new Docker images for Knative

https://gerrit.wikimedia.org/r/1116826

Change #1116826 merged by Elukey:

[operations/deployment-charts@master] admin_ng: set new Docker images for Knative

https://gerrit.wikimedia.org/r/1116826

Change #1117164 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: allow tuning securityContext on ml-staging's knative

https://gerrit.wikimedia.org/r/1117164

Change #1117164 merged by Elukey:

[operations/deployment-charts@master] admin_ng: allow tuning securityContext on ml-staging's knative

https://gerrit.wikimedia.org/r/1117164

Change #1117207 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] knative: fix patch command and backport for patches for PSS migration

https://gerrit.wikimedia.org/r/1117207

Change #1117207 merged by Elukey:

[operations/docker-images/production-images@master] knative: fix patch command and backport for patches for PSS migration

https://gerrit.wikimedia.org/r/1117207

Change #1117230 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: bump knative serving's default image tags

https://gerrit.wikimedia.org/r/1117230

Change #1117230 merged by Elukey:

[operations/deployment-charts@master] admin_ng: bump knative serving's default image tags

https://gerrit.wikimedia.org/r/1117230

Change #1117492 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] knative: backport https://github.com/knative/serving/pull/13402

https://gerrit.wikimedia.org/r/1117492

Change #1117492 merged by Elukey:

[operations/docker-images/production-images@master] knative: backport https://github.com/knative/serving/pull/13402

https://gerrit.wikimedia.org/r/1117492

All right finally in staging we have knative/kserve containers passing the restricted PSS config (if applied to the namespace via label). Last containers standing:

pod or containers "istio-validation", "istio-proxy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost"

So now it should only be a matter of checking how Istio injects configs for the sidecar containers.

Change #1117939 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knserve-inference: add seccompProfile to the pod security context

https://gerrit.wikimedia.org/r/1117939

Change #1117939 merged by jenkins-bot:

[operations/deployment-charts@master] knserve-inference: add seccompProfile to the pod security context

https://gerrit.wikimedia.org/r/1117939

I have deployed the above change to all the services in ml-staging-codfw.
The following was successfully added to the predictor

securityContext:
  seccompProfile:
    type: RuntimeDefault

after deployment and I see the following annotations which didnt exist before

container.seccomp.security.alpha.kubernetes.io/kserve-container: runtime/default
container.seccomp.security.alpha.kubernetes.io/queue-proxy: runtime/default
container.seccomp.security.alpha.kubernetes.io/storage-initializer: runtime/default

Change #1115323 merged by Elukey:

[operations/deployment-charts@master] admin_ng: disable PSP binding for ml-staging-codfw

https://gerrit.wikimedia.org/r/1115323

The ml-staging-codfw cluster has been upgraded! Reporting the conversation with ML on IRC in here as well:

12:00  <elukey> I am going to summarize staging's status for everybody:
12:00  <elukey> - We are trying to move away from the Pod Security Policy configs (PSP) because in the new k8s version they will be removed, in favor of Pod Security Standards (PSS).
12:01  <elukey> - PSS offers 3 profiles, that corresponds to various "classes" of security restrictions. For example, most of our workloads for kserve are in the "restricted" profile.
12:03  <elukey> - The migration is a little complicated in our case since a kserve pod is essentially composed of multiple containers (from various layers, istio knative etc..) and all of them need to have the same security 
                restrictions applied (for example, seccomp profile etc..)
12:03  <elukey> - Why didn't we need it before? Since our PSP config auto-injected those when the pod was created, with PSS we can't do anymore so we need to be explicit.
12:04  <elukey> - So, kserve pod: 2 istio containers, knative queue, kserve-inference, storage-initializer. 
12:05  <elukey> - ml-staging-codfw is running with a patched knative-serving control plane that automatically injects "restricted" settings when the pod are created, and we use some defaults for seccomp at the pod level as 
                well (for example, for the istio containers). 
12:06  <elukey> Lemme know if the above is not clear or missing anything..
12:06  <elukey> the idea is to let it soak in staging for a bit, you do deployments etc.. and verify that all is stable
12:06  <elukey> then when we are confident we move to prod
12:07  <elukey> to complete the staging migration I'd need to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1115323 as well
12:07  <elukey> to disable PSP, but it should be fine
12:08  <elukey> what could go wrong? part , or things to look for: 
12:09  <elukey> - failed deployments and/or "events" reporting issues with security restrictions (like you deploy and the pods don't come up)
12:09  <elukey> - knative-serving acting weird in its new patched version, webhook not working and deployment failing, alerts from the k8s control plane, etc..
12:10  <isaranto> thanks for the summary! it is important to note though that at the moment any deployment we do on production will automatically get these changes
12:12  <elukey> exactly yes, but not the complicated part that is knative
12:12  <elukey> the new deployments will just auto-inject the seccomp profile annotations, that we already do via PSP
12:12  <isaranto> ack, thanks for clarifying
12:12  <elukey> so in theory even for prod it should be a no-op
12:13  <elukey> the rest needs an explicit admin_ng config

Now we wait for a bit that staging is used and checked/tested, then it is just a matter of deploying to prod :)

Change #1121591 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: move pod security settings for seccomp to staging only

https://gerrit.wikimedia.org/r/1121591

Change #1121591 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: move pod security settings for seccomp to staging only

https://gerrit.wikimedia.org/r/1121591

Change #1121599 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: fix articletopic-outlink's settings

https://gerrit.wikimedia.org/r/1121599

Change #1121602 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: apply new Knative docker images only to ml-staging

https://gerrit.wikimedia.org/r/1121602

Change #1121599 merged by Elukey:

[operations/deployment-charts@master] ml-services: fix articletopic-outlink's settings

https://gerrit.wikimedia.org/r/1121599

Change #1121602 merged by Elukey:

[operations/deployment-charts@master] admin_ng: apply new Knative docker images only to ml-staging

https://gerrit.wikimedia.org/r/1121602

Something really weird happened today, after a deployment of a kserve isvc in production. The pod-level change to force the seccomp default profile at the pod level seems to have caused istio-proxy to loose connectivity, basically blackholing all the traffic. It is not clear to me why this is happening, since everything works fine in staging. Maybe it is related to the knative changes? Or maybe something different?

Next steps:

  • Do a more in depth stress test of staging, like killing random pods etc..
  • Read istio changelogs to understand how upstream suggests to inject the value, maybe there is some obscure race condition that I am not aware of.

Change #1122129 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: fix drop capabilities

https://gerrit.wikimedia.org/r/1122129

I've killed all pods in ml-staging and I found a separate issue for knative (https://gerrit.wikimedia.org/r/1122129), that is not related to what we have seen in prod.

Change #1122129 merged by Elukey:

[operations/deployment-charts@master] knative-serving: fix drop capabilities

https://gerrit.wikimedia.org/r/1122129

Change #1122636 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: remove the need for the kserve container's securityContext

https://gerrit.wikimedia.org/r/1122636

Change #1122636 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: remove the need for the kserve container's securityContext

https://gerrit.wikimedia.org/r/1122636

Change #1123294 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] knative-serving: backport https://github.com/knative/serving/pull/14363

https://gerrit.wikimedia.org/r/1123294

Change #1123294 merged by Elukey:

[operations/docker-images/production-images@master] knative-serving: backport https://github.com/knative/serving/pull/14363

https://gerrit.wikimedia.org/r/1123294

Change #1123412 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: upgrade knative's docker images on ml-staging-codfw

https://gerrit.wikimedia.org/r/1123412

Change #1123412 merged by Elukey:

[operations/deployment-charts@master] admin_ng: upgrade knative's docker images on ml-staging-codfw

https://gerrit.wikimedia.org/r/1123412

New knative version deployed to staging, tested the removal of the kserve's container-securitycontext (since it is now automatically injected by knative) and it worked.

I also found https://github.com/istio/istio/issues/35894#issuecomment-1511634924 that basically suggests to set seccomp settings at the pod level (in our case, in the isvc's config) to make istio-{validation,proxy} working. We tried to do it in staging and it works, in prodution it caused traffic to be blackholed by the istio sidecar.

There are some differences between staging and prod:

  1. staging is running the new knative stack, that could play a role, but it doesn't inject anything to istio containers so I don't see a link between the two.
  2. staging is all running on Bookworm, meanwhile Production has a mixture of Bullseye and Bookworm. IIUC the seccomprofile that is being used comes from https://github.com/moby/moby/commits/master/profiles/seccomp/default.json, and the Kernel's version could surely play a role.

I double checked via /proc/$pid/status that the Seccomp fields are the same for an istio/envoy container in both staging and production (namely both lists a seccomp filter in use, I haven't found yet how to check if it is the same or not).

@klausman @isarantopoulos @achou The only thing that I can think of is the following:

  1. depool eqiad or codfw from inference.discovery.wmnet
  2. manually change an isvc in the depooled DC, and verify the problem with more time (what happens, errors, etc..)
  3. restored and repool once done

@klausman @isarantopoulos @achou The only thing that I can think of is the following:

  1. depool eqiad or codfw from inference.discovery.wmnet
  2. manually change an isvc in the depooled DC, and verify the problem with more time (what happens, errors, etc..)
  3. restored and repool once done

I think that's a good plan. The switch codfw->eqiad is slated for March 21st. If we waited that long, we would have the advantage that the then-backup DC (codfw) has a lower base load (since single-homed services tend to be in eqiad). But on the other hand, I don't think that is strictly necessary, and the sooner we can figure this discrepancy out, the better. Wdyt?

@klausman @isarantopoulos @achou The only thing that I can think of is the following:

  1. depool eqiad or codfw from inference.discovery.wmnet
  2. manually change an isvc in the depooled DC, and verify the problem with more time (what happens, errors, etc..)
  3. restored and repool once done

I think that's a good plan. The switch codfw->eqiad is slated for March 21st. If we waited that long, we would have the advantage that the then-backup DC (codfw) has a lower base load (since single-homed services tend to be in eqiad). But on the other hand, I don't think that is strictly necessary, and the sooner we can figure this discrepancy out, the better. Wdyt?

@klausman we can check what inference DC takes the majority of the traffic and then depool the other one for a couple of hours, it shouldn't be a big deal, capacity wise we are able to handle all traffic from one DC.

@klausman we can check what inference DC takes the majority of the traffic and then depool the other one for a couple of hours, it shouldn't be a big deal, capacity wise we are able to handle all traffic from one DC.

SGTM. Judging by this graph:
https://thanos.wikimedia.org/graph?g0.expr=sum%20by%20(site)%20(rate(container_network_transmit_bytes_total%7Bjob%3D%22k8s-node-cadvisor%22%2C%20instance%3D~%22ml-serve.*%22%7D%5B5m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2025-03-03%2010%3A38%3A08&g0.moment_input=2025-03-03%2010%3A38%3A08

the DC which currently is serving more traffic is codfw, though eqiad is close and occasionally spikes harder than codfw. So I'd say we drain eqiad and do our testing.

Difference for the Docker's default seccomp profile between Bullseye (docker.io v20.10.5) and Bookworm (docker.io v20.10.24):

~/github/moby$ git diff v20.10.5 v20.10.24 -- ./profiles/seccomp/default.json
diff --git a/profiles/seccomp/default.json b/profiles/seccomp/default.json
index 4213799ddb..dbf1fa86f1 100644
--- a/profiles/seccomp/default.json
+++ b/profiles/seccomp/default.json
@@ -126,6 +126,7 @@
                                "ftruncate64",
                                "futex",
                                "futex_time64",
+                               "futex_waitv",
                                "futimesat",
                                "getcpu",
                                "getcwd",
@@ -182,6 +183,9 @@
                                "io_uring_setup",
                                "ipc",
                                "kill",
+                               "landlock_add_rule",
+                               "landlock_create_ruleset",
+                               "landlock_restrict_self",
                                "lchown",
                                "lchown32",
                                "lgetxattr",
@@ -199,6 +203,7 @@
                                "madvise",
                                "membarrier",
                                "memfd_create",
+                               "memfd_secret",
                                "mincore",
                                "mkdir",
                                "mkdirat",
@@ -246,6 +251,7 @@
                                "preadv",
                                "preadv2",
                                "prlimit64",
+                               "process_mrelease",
                                "pselect6",
                                "pselect6_time64",
                                "pwrite64",
@@ -591,6 +597,7 @@
                        "names": [
                                "bpf",
                                "clone",
+                               "clone3",
                                "fanotify_init",
                                "fsconfig",
                                "fsmount",
@@ -598,11 +605,13 @@
                                "fspick",
                                "lookup_dcookie",
                                "mount",
+                               "mount_setattr",
                                "move_mount",
                                "name_to_handle_at",
                                "open_tree",
                                "perf_event_open",
                                "quotactl",
+                               "quotactl_fd",
                                "setdomainname",
                                "sethostname",
                                "setns",
@@ -670,6 +679,21 @@
                                ]
                        }
                },
+               {
+                       "names": [
+                               "clone3"
+                       ],
+                       "action": "SCMP_ACT_ERRNO",
+                       "errnoRet": 38,
+                       "args": [],
+                       "comment": "",
+                       "includes": {},
+                       "excludes": {
+                               "caps": [
+                                       "CAP_SYS_ADMIN"
+                               ]
+                       }
+               },
                {
                        "names": [
                                "reboot"

The problem should be https://github.com/istio/istio/issues/44244. In Bullseye's docker version the seccomp default profile doesn't allow clone3, that it was reported generating a failure in the creation of the Envoy threads/processes. It fits with what we are seeing because staging is entirely on Bookworm, prod it is not.

So good news, staging seems to work fine and we can keep testing it. Before proceeding to prod we'll need to upgrade to Bookworm :)

elukey changed the task status from Open to Stalled.Mar 10 2025, 8:20 AM

This task is stalled until T387854 is completed.

Change #1133315 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add seccomp profile to editquality-reverted in codfw

https://gerrit.wikimedia.org/r/1133315

Change #1133315 merged by Elukey:

[operations/deployment-charts@master] ml-services: add seccomp profile to editquality-reverted in codfw

https://gerrit.wikimedia.org/r/1133315

I tried to deploy https://gerrit.wikimedia.org/r/1133315 to a single NS on ml-serve-codfw, and ended up with pods not getting restarted. I killed one to force a re-creation, but the pod didn't pass the Init step.

In events I found the following:

root@deploy1003:/srv/deployment-charts/helmfile.d/ml-services/revscoring-editquality-reverted# kubectl get events -n revscoring-editquality-reverted 
LAST SEEN   TYPE      REASON              OBJECT                                                                   MESSAGE
3m13s       Normal    Scheduled           pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Successfully assigned revscoring-editquality-reverted/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697 to ml-serve2008.codfw.wmnet
3m11s       Normal    Pulled              pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Container image "docker-registry.discovery.wmnet/istio/proxyv2:1.15.7-2" already present on machine
3m11s       Normal    Created             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Created container istio-validation
3m11s       Normal    Started             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Started container istio-validation
3m10s       Normal    Pulled              pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Container image "docker-registry.discovery.wmnet/kserve-storage-initializer:0.11.2-4" already present on machine
3m10s       Normal    Created             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Created container storage-initializer
3m10s       Normal    Started             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697      Started container storage-initializer
3m13s       Normal    SuccessfulCreate    replicaset/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cf   Created pod: bnwiki-reverted-predictor-default-00023-deployment-9c6bd5c8n697
2m43s       Warning   Unhealthy           pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      Readiness probe failed: Get "http://10.194.22.65:15021/healthz/ready": dial tcp 10.194.22.65:15021: connect: connection refused
3m13s       Normal    Killing             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      Stopping container kserve-container
3m13s       Normal    Killing             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      Stopping container istio-proxy
2m41s       Normal    Killing             pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      Stopping container queue-proxy
3m9s        Warning   Unhealthy           pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      Readiness probe failed: HTTP probe failed with statuscode: 503
3m8s        Warning   FailedPreStopHook   pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      HTTP lifecycle hook (/wait-for-drain) for Container "kserve-container" in Pod "bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv_revscoring-editquality-reverted(ef06a517-37f8-4331-bf49-de7be71b7b06)" failed - error: Get "http://10.194.22.65:8022/wait-for-drain": dial tcp 10.194.22.65:8022: connect: connection refused, message: ""
2m45s       Warning   Unhealthy           pod/bnwiki-reverted-predictor-default-00023-deployment-9c6bd5cprdpv      Readiness probe failed: Get "http://10.194.22.65:15020/app-health/queue-proxy/readyz": dial tcp 10.194.22.65:15020: connect: connection refused
3m12s       Warning   InternalError       revision/bnwiki-reverted-predictor-default-00023                         failed to update deployment "bnwiki-reverted-predictor-default-00023-deployment": Operation cannot be fulfilled on deployments.apps "bnwiki-reverted-predictor-default-00023-deployment": the object has been modified; please apply your changes to the latest version and try again
107s        Warning   InternalError       inferenceservice/bnwiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
106s        Warning   InternalError       inferenceservice/elwiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
104s        Warning   InternalError       inferenceservice/enwiktionary-reverted                                   fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
103s        Warning   InternalError       inferenceservice/glwiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
102s        Warning   InternalError       inferenceservice/hrwiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
101s        Warning   InternalError       inferenceservice/idwiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
99s         Warning   InternalError       inferenceservice/iswiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
98s         Warning   InternalError       inferenceservice/tawiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile
97s         Warning   InternalError       inferenceservice/viwiki-reverted                                         fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile

My understanding is that Knative is now complaining about the new security context field, so we'll need the new patched version (currently running in staging) when moving to PSS.

Change #1139865 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: Update Knative on ml-serve-codfw

https://gerrit.wikimedia.org/r/1139865

Change #1139865 merged by Elukey:

[operations/deployment-charts@master] admin_ng: Update Knative on ml-serve-codfw

https://gerrit.wikimedia.org/r/1139865

Change #1140086 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: allow to set seccomp for Knative-based pods on ml-serve-codfw

https://gerrit.wikimedia.org/r/1140086

Change #1140086 merged by Elukey:

[operations/deployment-charts@master] admin_ng: allow to set seccomp for Knative-based pods on ml-serve-codfw

https://gerrit.wikimedia.org/r/1140086

elukey@deploy1003:~$ httpbb --hosts inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/production/test_revscoring-editquality-reverted.yaml 
Sending to inference.svc.codfw.wmnet...
PASS: 9 requests sent to inference.svc.codfw.wmnet. All assertions passed.

It turns out that https://gerrit.wikimedia.org/r/1140086 was needed alongside with the new knative docker images to allow seccomp to be modified/injected.

Change #1140120 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: enable seccomp defaults for ml-serve-codfw's isvcs

https://gerrit.wikimedia.org/r/1140120

Next steps:

  • enable seccomp default settings for all ml-serve-codfw isvsc (https://gerrit.wikimedia.org/r/1140120)
  • Flip secure-pod-defaults: "enabled" on ml-serve-codfw's knative settings, and kill/respawn all pods so they pick up the new correct security settings (all the isvcs for sure). This is probably worth to be done with codfw depooled in inference (LVS).
  • Test that all isvcs are running and then repool.

Once the above is done, we'll be able to test https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement and move ml-serve-codfw to PSS.

Change #1140120 merged by Elukey:

[operations/deployment-charts@master] ml-services: enable seccomp defaults for ml-serve-codfw's isvcs

https://gerrit.wikimedia.org/r/1140120

Change #1140140 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw

https://gerrit.wikimedia.org/r/1140140

Change #1140140 merged by Elukey:

[operations/deployment-charts@master] admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw

https://gerrit.wikimedia.org/r/1140140

Mentioned in SAL (#wikimedia-operations) [2025-05-05T09:38:26Z] <elukey> depool inference/codfw from DNS discovery to safely apply new pod/container security settings - T369493

Summary of what I've done:

for namespace in  kubectl get ns | egrep -v "(NAME|default|cert-manager|external-services|istio-system|knative-serving|kserve|kube-node-lease|kube-public|kube-system)" | cut -d " " -f 1`; do echo $namespace; for pod in `kubectl get pods -n ${namespace} | grep -v NAME | cut -d " " -f 1`; do echo $pod; kubectl delete pod $pod -n $namespace --grace-period 5; done; done`
  • tested all endpoints via httpbb
  • repooled codfw

The following worked fine on ml-serve-codfw:

root@deploy1003:~# kubectl get ns -l pod-security.kubernetes.io/audit=restricted -o name | while read ns; do
    kubectl label --dry-run=server --overwrite "$ns" pod-security.kubernetes.io/enforce=restricted;
done
namespace/article-descriptions labeled
namespace/article-models labeled
namespace/articletopic-outlink labeled
namespace/cert-manager labeled
namespace/experimental labeled
namespace/external-services labeled
namespace/istio-system labeled
namespace/knative-serving labeled
namespace/kserve labeled
namespace/llm labeled
namespace/logo-detection labeled
namespace/ores-legacy labeled
namespace/readability labeled
namespace/recommendation-api-ng labeled
namespace/revertrisk labeled
namespace/revision-models labeled
namespace/revscoring-articlequality labeled
namespace/revscoring-articletopic labeled
namespace/revscoring-draftquality labeled
namespace/revscoring-drafttopic labeled
namespace/revscoring-editquality-damaging labeled
namespace/revscoring-editquality-goodfaith labeled
namespace/revscoring-editquality-reverted labeled

Change #1141858 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: disable PSP mutations for ml-serve-codfw

https://gerrit.wikimedia.org/r/1141858

Change #1141858 merged by Elukey:

[operations/deployment-charts@master] admin_ng: disable PSP mutations for ml-serve-codfw

https://gerrit.wikimedia.org/r/1141858

elukey changed the task status from Stalled to Open.May 5 2025, 1:51 PM

Change #1141910 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: enforce PSS on ml-serve-codfw

https://gerrit.wikimedia.org/r/1141910

Change #1141910 merged by Elukey:

[operations/deployment-charts@master] admin_ng: enforce PSS on ml-serve-codfw

https://gerrit.wikimedia.org/r/1141910

Change #1141928 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw

https://gerrit.wikimedia.org/r/1141928

High level procedure for eqiad:

Change #1141928 merged by Elukey:

[operations/puppet@production] kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw

https://gerrit.wikimedia.org/r/1141928

Change #1151596 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: Update knative-serving image versions for ml-serve-eqiad

https://gerrit.wikimedia.org/r/1151596

Change #1151597 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: enable podspec-securitycontext for all knative clusters

https://gerrit.wikimedia.org/r/1151597

Change #1151600 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: set seccomp defaults in the chart

https://gerrit.wikimedia.org/r/1151600

Change #1151604 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set secure-pod-defaults to "enabled" for knative clusters

https://gerrit.wikimedia.org/r/1151604

Change #1151596 merged by Elukey:

[operations/deployment-charts@master] admin_ng: Update knative-serving image versions for ml-serve-eqiad

https://gerrit.wikimedia.org/r/1151596

Change #1151597 merged by Elukey:

[operations/deployment-charts@master] admin_ng: enable podspec-securitycontext for all knative clusters

https://gerrit.wikimedia.org/r/1151597

Change #1151600 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: set seccomp defaults in the chart

https://gerrit.wikimedia.org/r/1151600

Change #1151604 merged by Elukey:

[operations/deployment-charts@master] admin_ng: set secure-pod-defaults to "enabled" for knative clusters

https://gerrit.wikimedia.org/r/1151604

@klausman since today it was very quiet for ML, I took the opportunity to apply all the changes stated in T369493#10792884 (including recycling all the isvc pods in ml-serve-eqiad).

In theory now ml-serve-eqiad is ready for the steps in https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement (we have already done them to ml-serve-codfw and ml-staging so we can proceed quickly).

Change #1152190 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: disable PSP and enable PSS for ml-serve-eqiad

https://gerrit.wikimedia.org/r/1152190

Change #1152190 merged by Elukey:

[operations/deployment-charts@master] admin_ng: disable PSP and enable PSS for ml-serve-eqiad

https://gerrit.wikimedia.org/r/1152190

Change #1152194 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: disable PSP for ml-serve-eqiad

https://gerrit.wikimedia.org/r/1152194

Change #1152194 merged by Elukey:

[operations/puppet@production] kubernetes: disable PSP for ml-serve-eqiad

https://gerrit.wikimedia.org/r/1152194

Recycled all the pods in ml-serve-eqiad to be sure, no PSS violation registered. Migration completed!