Page MenuHomePhabricator

Update all helm modules and charts to be compatible with the restricted PSS
Closed, ResolvedPublic

Description

We need to update all helm chart modules (and all charts ofc.) to be compatible with the restricted PSS profile.

As far as I can tell rn this is mostly adding a proper securityContext to all containers:

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
     drop:
     - ALL
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault

There is another "update everything" task at: T346638: Rename the envoy's uses_ingress option to sets_sni to cross check for synergy effects...

Missing charts/deployments:

  • spark-operator @BTullis || @brouberol
  • mediawiki-dev (we probably don't really need to do this, but might be wise for consistency)
  • mediawiki

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+22 -194
operations/deployment-chartsmaster+13 -0
operations/deployment-chartsmaster+17 -0
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+10 -3
operations/deployment-chartsmaster+0 -19
operations/deployment-chartsmaster+208 -59
operations/alertsmaster+3 -3
operations/deployment-chartsmaster+59 -198
operations/deployment-chartsmaster+198 -59
operations/deployment-chartsmaster+5 -15
operations/deployment-chartsmaster+15 -5
operations/deployment-chartsmaster+17 -2
operations/deployment-chartsmaster+22 -2
operations/deployment-chartsmaster+22 -5
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+163 -64
operations/deployment-chartsmaster+75 -6
operations/deployment-chartsmaster+30 -13
operations/deployment-chartsmaster+51 -29
operations/deployment-chartsmaster+34 -14
operations/deployment-chartsmaster+12 -1
operations/deployment-chartsmaster+31 -13
operations/deployment-chartsmaster+250 -108
operations/deployment-chartsmaster+26 -9
operations/deployment-chartsmaster+262 -115
operations/deployment-chartsmaster+291 -126
operations/deployment-chartsmaster+266 -103
operations/deployment-chartsmaster+272 -108
operations/deployment-chartsmaster+175 -85
operations/deployment-chartsmaster+1 K -1 K
operations/deployment-chartsmaster+286 -128
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+141 -43
operations/deployment-chartsmaster+141 -43
operations/deployment-chartsmaster+162 -49
operations/deployment-chartsmaster+235 -114
operations/deployment-chartsmaster+48 -27
operations/deployment-chartsmaster+287 -102
operations/deployment-chartsmaster+29 -10
operations/deployment-chartsmaster+29 -10
operations/deployment-chartsmaster+144 -79
operations/deployment-chartsmaster+149 -82
operations/deployment-chartsmaster+277 -179
operations/deployment-chartsmaster+159 -96
operations/deployment-chartsmaster+201 -102
operations/deployment-chartsmaster+27 -10
operations/deployment-chartsmaster+270 -143
operations/deployment-chartsmaster+172 -79
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+79 -172
operations/deployment-chartsmaster+172 -79
operations/deployment-chartsmaster+140 -66
operations/deployment-chartsmaster+31 -16
operations/deployment-chartsmaster+161 -96
operations/deployment-chartsmaster+67 -19
operations/deployment-chartsmaster+257 -139
operations/deployment-chartsmaster+82 -26
operations/deployment-chartsmaster+24 -9
operations/deployment-chartsmaster+138 -71
operations/deployment-chartsmaster+25 -9
operations/deployment-chartsmaster+151 -86
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+78 -0
operations/deployment-chartsmaster+24 -9
operations/deployment-chartsmaster+32 -7
operations/deployment-chartsmaster+42 -37
operations/deployment-chartsmaster+1 K -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1037196 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] chromium-render: add securityContext to all containers

https://gerrit.wikimedia.org/r/1037196

Change #1037615 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: add securityContext to all containers

https://gerrit.wikimedia.org/r/1037615

Change #1037861 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] eventstreams: add securityContext to all production containers

https://gerrit.wikimedia.org/r/1037861

Change #1030190 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add securityContext to all containers

https://gerrit.wikimedia.org/r/1030190

Change #1031105 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: ensure all containers have securityContext

https://gerrit.wikimedia.org/r/1031105

Change #1039727 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] flink-app: Update various modules

https://gerrit.wikimedia.org/r/1039727

Change #1037194 abandoned by Scott French:

[operations/deployment-charts@master] similar-users: add securityContext to all containers

Reason:

Turndown planned in https://phabricator.wikimedia.org/T345274

https://gerrit.wikimedia.org/r/1037194

Change #1039727 merged by jenkins-bot:

[operations/deployment-charts@master] flink-app: Update various modules

https://gerrit.wikimedia.org/r/1039727

Change #1032525 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Update various modules

https://gerrit.wikimedia.org/r/1032525

Change #1037196 merged by jenkins-bot:

[operations/deployment-charts@master] chromium-render: add securityContext to all containers

https://gerrit.wikimedia.org/r/1037196

Change #1040220 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] admin_ng: bump CPU resourcequota for proton

https://gerrit.wikimedia.org/r/1040220

Change #1040221 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] proton: drop replicas from 12 to 10

https://gerrit.wikimedia.org/r/1040221

Change #1032519 merged by jenkins-bot:

[operations/deployment-charts@master] push-notifications: add securityContext to all containers

https://gerrit.wikimedia.org/r/1032519

Change #1037163 merged by jenkins-bot:

[operations/deployment-charts@master] function-orchestrator: ensure all containers have securityContext

https://gerrit.wikimedia.org/r/1037163

Change #1037162 merged by jenkins-bot:

[operations/deployment-charts@master] function-evaluator: ensure all containers have securityContext

https://gerrit.wikimedia.org/r/1037162

Change #1041076 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calculator-service: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041076

Change #1041070 merged by Brouberol:

[operations/deployment-charts@master] spark-history: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041070

Change #1041071 merged by Brouberol:

[operations/deployment-charts@master] echoserver: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041071

Change #1041119 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] datasets-config: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041119

Change #1041120 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mpic: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041120

Change #1041120 merged by Brouberol:

[operations/deployment-charts@master] mpic: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041120

Change #1041119 merged by Brouberol:

[operations/deployment-charts@master] datasets-config: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041119

Change #1040221 merged by jenkins-bot:

[operations/deployment-charts@master] proton: drop replicas from 12 to 10

https://gerrit.wikimedia.org/r/1040221

Change #1040220 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: bump CPU resourcequota for proton

https://gerrit.wikimedia.org/r/1040220

Change #1041161 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] flink-operator: add securityContext

https://gerrit.wikimedia.org/r/1041161

Change #1037165 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: ensure all containers have securityContext

https://gerrit.wikimedia.org/r/1037165

Change #1041049 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041049

Change #1041039 merged by jenkins-bot:

[operations/deployment-charts@master] developer-portal: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041039

Change #1041055 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041055

Change #1041072 merged by jenkins-bot:

[operations/deployment-charts@master] python-webapp: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041072

Change #1037164 merged by jenkins-bot:

[operations/deployment-charts@master] wikifeeds: ensure all containers have securityContext

https://gerrit.wikimedia.org/r/1037164

Change #1041076 merged by jenkins-bot:

[operations/deployment-charts@master] calculator-service: add securityContext to all containers

https://gerrit.wikimedia.org/r/1041076

Change #1037193 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: add securityContext to all containers

https://gerrit.wikimedia.org/r/1037193

Change #1041161 merged by jenkins-bot:

[operations/deployment-charts@master] flink-operator: add securityContext

https://gerrit.wikimedia.org/r/1041161

Change #1037615 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: add securityContext to all containers

https://gerrit.wikimedia.org/r/1037615

Change #1038859 merged by jenkins-bot:

[operations/deployment-charts@master] mcrouter: Bump chart modules

https://gerrit.wikimedia.org/r/1038859

Change #1037861 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: add securityContext to all production containers

https://gerrit.wikimedia.org/r/1037861

Change #1037195 merged by jenkins-bot:

[operations/deployment-charts@master] kask: add securityContext to all containers

https://gerrit.wikimedia.org/r/1037195

Change #1042256 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] toolhub: Add missing securityContext to CronJob

https://gerrit.wikimedia.org/r/1042256

Change #1032714 merged by jenkins-bot:

[operations/deployment-charts@master] Global update of test-service-checker template

https://gerrit.wikimedia.org/r/1032714

Change #1042256 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add missing securityContext to CronJob

https://gerrit.wikimedia.org/r/1042256

Change #1042440 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: add securityContext to all containers

https://gerrit.wikimedia.org/r/1042440

Change #1042838 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-operator: add securityContext to all containers

https://gerrit.wikimedia.org/r/1042838

Change #1042838 merged by Brouberol:

[operations/deployment-charts@master] spark-operator: add securityContext to all containers

https://gerrit.wikimedia.org/r/1042838

Change #1043846 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki-dev: add securityContext to all containers

https://gerrit.wikimedia.org/r/1043846

Change #1043846 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki-dev: add securityContext to all containers

https://gerrit.wikimedia.org/r/1043846

Change #1046692 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: enable securityContext in all canaries

https://gerrit.wikimedia.org/r/1046692

Change #1046693 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: enable securityContext everywhere

https://gerrit.wikimedia.org/r/1046693

Change #1026954 merged by jenkins-bot:

[operations/deployment-charts@master] kserve-inference: add securityContext explicit config

https://gerrit.wikimedia.org/r/1026954

Change #1042440 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: add securityContext to all containers

https://gerrit.wikimedia.org/r/1042440

Change #1046692 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: enable securityContext in all canaries

https://gerrit.wikimedia.org/r/1046692

Change #1049246 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "mediawiki: enable securityContext in all canaries"

https://gerrit.wikimedia.org/r/1049246

Change #1049246 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mediawiki: enable securityContext in all canaries"

https://gerrit.wikimedia.org/r/1049246

Change #1049248 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "mediawiki: add securityContext to all containers"

https://gerrit.wikimedia.org/r/1049248

Change #1049248 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mediawiki: add securityContext to all containers"

https://gerrit.wikimedia.org/r/1049248

Alright, first the good news: I was able to deploy the mediawiki changes to mw-debug and canary releases for one service (mw-api-int), and confirmed that (1) slow logs still work and (2) no obvious increase in error rates etc.

Now the less-good news: We appear to use the previously hard-coded "local_service" upstream cluster name in a number of dashboard and alerts, and alas that changes with newer versions of mesh.configuration.

Before moving forward, I'll need to audit / fix those.

Change #1049260 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/alerts@master] mw-on-k8s: extend envoy_cluster_name to new format

https://gerrit.wikimedia.org/r/1049260

I've manually updated prometheus queries that previously limited envoy_cluster_name to "local_service" to be compatible with the new naming scheme on the following MW-related dashboards:

  • SRE Service Operations > mw-on-k8s Overview
  • Service > MediaWiki on k8s
  • Service > mw-api-ext
  • Service > mw-api-int
  • Service > mw-jobrunner
  • Service > mw-parsoid
  • Service > mw-web

That's everything that comes to mind from a quick scan of "published" (non-draft) dashboards. @JMeybohm can you think of anything else that might need updated?

That's everything that comes to mind from a quick scan of "published" (non-draft) dashboards. @JMeybohm can you think of anything else that might need updated?

Not right away. You could use grafana-wtf (https://wikitech.wikimedia.org/wiki/Grafana#Search/audit_metrics_usage_across_dashboards) to search for remaining occurrences of this across all dashboards.

Change #1049260 merged by jenkins-bot:

[operations/alerts@master] mw-on-k8s: extend envoy_cluster_name to new format

https://gerrit.wikimedia.org/r/1049260

Ah, perfect! Thank you @JMeybohm - search-grafana-dashboards.js uncovered one more dashboard to migrate, and an older one that could be deleted (apple-search). I also decided to go ahead and update "mw on k8s - WIP ServiceOps" since it's easy to stumble upon by accident.

So, setting aside two personal / experimental dashboards (I'll ping the owners directly) I believe that should be everything.

Change #1049607 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: add securityContext to all containers (attempt 2)

https://gerrit.wikimedia.org/r/1049607

Change #1049607 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: add securityContext to all containers (attempt 2)

https://gerrit.wikimedia.org/r/1049607

Mentioned in SAL (#wikimedia-operations) [2024-06-27T17:33:26Z] <swfrench-wmf> canary deployments are healthy, slow-logs still produced, continuing with main deployments for T362978

Change #1046693 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: enable securityContext everywhere

https://gerrit.wikimedia.org/r/1046693

Mentioned in SAL (#wikimedia-operations) [2024-06-27T17:41:30Z] <swfrench@deploy1002> Started scap: Deploying securityContext changes for T362978 to main release

Mentioned in SAL (#wikimedia-operations) [2024-06-27T17:45:39Z] <swfrench@deploy1002> Finished scap: Deploying securityContext changes for T362978 to main release (duration: 04m 09s)

mw-debug and canaries were updated around 17:23 UTC (alas, I hit enter before adding the message on that scap invocation) and main releases around 17:45. About 40m on, things continue to look good - general service health looks fine, slow-logs work, and dashboards picked up the envoy metrics with the change in cluster name.

Great news! I'd say that concludes this task. Thanks for all of the help and patience getting this over the finish line!

Change #1051111 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cfssl-issuer: Add container securityContext

https://gerrit.wikimedia.org/r/1051111

Change #1051111 merged by jenkins-bot:

[operations/deployment-charts@master] cfssl-issuer: Add container securityContext

https://gerrit.wikimedia.org/r/1051111

Hi I'm having issues with a flink job running in staging and failing to deploy with an error:
>>> Status | Error | DEPLOYED | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"pods \"flink-app-consumer-search-784bc9fd87-9n862\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"flink-main-container\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"flink-main-container\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"flink-main-container\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"flink-main-container\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","additionalMetadata":{"reason":"FailedCreate"},"throwableList":[]}

I wonder if it could be related to this task? Thanks!

Yes, kind of. What deploy command did you run (for me to reproduce)?

Change #1051140 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] flink-app: Add securityContext to the flink container

https://gerrit.wikimedia.org/r/1051140

Change #1051140 merged by jenkins-bot:

[operations/deployment-charts@master] flink-app: Add securityContext to the flink container

https://gerrit.wikimedia.org/r/1051140

FTR: It was the cirrus-streaming-updater depoyment in staging that failed. Looks like we missed adding the securityContext to the actual flink application container (we've done the envoy sidecar only).

Change #1051685 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add securityContext to istio components

https://gerrit.wikimedia.org/r/1051685

Change #1051685 merged by jenkins-bot:

[operations/deployment-charts@master] Add securityContext to istio components

https://gerrit.wikimedia.org/r/1051685

Change #1051690 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add securityContext to opentelemetry pods

https://gerrit.wikimedia.org/r/1051690

Mentioned in SAL (#wikimedia-operations) [2024-07-03T08:53:02Z] <jayme> deployed istio (adding securityContext) to wikikube clusters - T362978

Change #1051690 merged by jenkins-bot:

[operations/deployment-charts@master] Add securityContext to opentelemetry pods

https://gerrit.wikimedia.org/r/1051690

Change #1052700 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] aux: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052700

Change #1052701 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] dse: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052701

Change #1052702 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ml: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052702

Change #1052700 merged by Elukey:

[operations/deployment-charts@master] aux: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052700

Change #1052702 merged by jenkins-bot:

[operations/deployment-charts@master] ml: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052702

Change #1052701 merged by jenkins-bot:

[operations/deployment-charts@master] dse: Add securityContext to istio components

https://gerrit.wikimedia.org/r/1052701

Mentioned in SAL (#wikimedia-analytics) [2024-07-22T08:17:06Z] <brouberol> deploy istio (adding securityContext) to dse-k8s-eqiad cluster - T362978

Change #1068869 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] k8s-controller-sidecars: adopt securityContext

https://gerrit.wikimedia.org/r/1068869

Change #1068869 merged by jenkins-bot:

[operations/deployment-charts@master] k8s-controller-sidecars: adopt securityContext

https://gerrit.wikimedia.org/r/1068869