Page MenuHomePhabricator

Allow members of restricted to run maintenance scripts
Closed, ResolvedPublic

Description

A point surfaced in T341553#10268045: Members of the restricted group can run classic mwscript but can't run mwscript-k8s, as they can't log into the deployment hosts nor read the mw-script{,-deploy} kube configs. This affects on the order of a dozen people who actually log into the mwmaint hosts.

Some options:

  1. Decide that members of restricted can no longer run maintenance scripts when we turn off the mwmaint hosts. (This is probably not desired, but it's the default outcome if we do nothing.)
  2. Add all members of restricted to deployment. (This grants more privileges than necessary, and is probably not desired.)
  3. Let members of restricted run mwscript-k8s on the deploy host.
    • That means letting them log into the deploy host, which means we should check the existing sudo rules and make sure we're okay with carrying them over.
    • It also means changing the owner of the kube configs (presently mwdeploy:deployment). Not sure if we'll need some new group mw-script-deployers containing-by-reference the members of both restricted and deployment in order to make this work.
  4. Let members of restricted run mwscript-k8s on some new VM instead. (That would be a pain to set up, and a bigger pain to maintain, but it would avoid messing with permissions on the deployment hosts.)

Details

Other Assignee
RLazarus
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+44 -32
operations/deployment-chartsmaster+30 -25
operations/puppetproduction+27 -2
operations/deployment-chartsmaster+33 -23
operations/deployment-chartsmaster+27 -17
operations/puppetproduction+1 -1
operations/puppetproduction+14 -3
operations/puppetproduction+2 -4
operations/deployment-chartsmaster+69 -29
operations/deployment-chartsmaster+9 -9
operations/puppetproduction+3 -6
operations/puppetproduction+2 -1
operations/puppetproduction+4 -2
operations/puppetproduction+8 -5
operations/puppetproduction+12 -16
operations/puppetproduction+66 -43
operations/puppetproduction+9 -0
Show related patches Customize query in gerrit

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ranked by order of preference:

  • 3 seems like the "cleanest", a spin on this could be to deploy a couple of config files specific to mw-script running with the new group so we contain the permission change to this particular use-case. This permission set could be used for mw-cron (T341555) as well so people responsible for periodic maintenance jobs can act on them without necessarily needing deployment permissions.
  • 2 is the easiest, but we need to assess the impact of widening this permission set since it is quite powerful.
  • 4 is basically recreating mwmaint as a VM. Thinking about it, it can be interesting for separation of privileges, but as you said, set up and maintenance would be a burden.
  • 1 is a non-starter in my opinion, we don't want to take away people's ability to do work with this migration but enable new ways to do it.

I'm listed as approver for restricted and deployment.

When I evaluate requests for additions to those groups, I aim for granting the least privilege. That is, folks in deployment have a superset of the needs of the restricted folks.

Given that mental model, ordered ranking:

  • Option 4 – maintains the mental model I used, status quo, we know this preserves permissions + gives folks needed access. But I gather it's not trivial, so I'm not clinging to this option.
  • Option 3 – good trade off of risks vs effort. Seems like risks could be mitigated.
  • Option 2 – Has risks, unclear if the risks could be mitigated, maybe acceptable.
  • Option 1 – Folks in the group need to run maintenance scripts. restricted could be re-audited, maybe only 1 or 2 still need to run maintenance scripts, which may make option 2 more appealing.

Risks for options 2&3 (that I can think of):

  • Sudoer rule carryover – moving group to new machine, keeping the same sudo rules from data.yaml—needs checking
  • Accidents:
    • revealing secrets – via copy-paste errors into public spaces of something meant to be super secret—hopefully solvable auditing permissions for deployment/secret-related repos (e.g., /etc/kubernetes,/srv/mediawiki-staging,/srv/patches, /srv/deployment, others(?))
    • mistyped commands – never completely solvable, but same permission auditing as secrets

Just in case it's helpful for more use cases: Trust and Safety routinely uses maintenance scripts for various support stuff but also for removing illegal content when legally required. So finding a way for that to continue would be much appreciated!

RLazarus added a subscriber: Raine.

Assigning as discussed in the serviceops meeting. Thanks!

Change #1114963 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add restricted users to deployment_server

https://gerrit.wikimedia.org/r/1114963

We decided to go with option 3 as it seems the most intuitive and does not impose additional maintenance burden. There's a new sudoers rule coming in, allowing restricted users to run commands as www-data which I don't think is particularly worrisome in context of the deploy hosts.
A basic check of the obvious directories revealed nothing as well.

We should also keep in mind that users in the restricted group are highly trusted individuals anyways,

Change #1114963 merged by JMeybohm:

[operations/puppet@production] Add restricted users to deployment_server

https://gerrit.wikimedia.org/r/1114963

kubeconfig files (and certificates) have been created

-rw-r----- 1 mwdeploy restricted  490 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-codfw.config
-rw-r----- 1 mwdeploy restricted  518 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-deploy-codfw.config
-rw-r----- 1 mwdeploy restricted  518 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-deploy-eqiad.config
-rw-r----- 1 mwdeploy restricted  539 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-deploy-staging-codfw.config
-rw-r----- 1 mwdeploy restricted  539 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-deploy-staging.config
-rw-r----- 1 mwdeploy restricted  539 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-deploy-staging-eqiad.config
-rw-r----- 1 mwdeploy restricted  490 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-eqiad.config
-rw-r----- 1 mwdeploy restricted  511 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-staging-codfw.config
-rw-r----- 1 mwdeploy restricted  511 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-staging.config
-rw-r----- 1 mwdeploy restricted  511 Jan 30 10:23 /etc/kubernetes/mw-cron-restricted-staging-eqiad.config
-rw-r----- 1 mwdeploy restricted  500 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-codfw.config
-rw-r----- 1 mwdeploy restricted  528 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-deploy-codfw.config
-rw-r----- 1 mwdeploy restricted  528 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-deploy-eqiad.config
-rw-r----- 1 mwdeploy restricted  549 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-deploy-staging-codfw.config
-rw-r----- 1 mwdeploy restricted  549 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-deploy-staging.config
-rw-r----- 1 mwdeploy restricted  549 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-deploy-staging-eqiad.config
-rw-r----- 1 mwdeploy restricted  500 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-eqiad.config
-rw-r----- 1 mwdeploy restricted  521 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-staging-codfw.config
-rw-r----- 1 mwdeploy restricted  521 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-staging.config
-rw-r----- 1 mwdeploy restricted  521 Jan 30 10:23 /etc/kubernetes/mw-script-restricted-staging-eqiad.config

Assigning back to you @RLazarus for tool-side implementation

Thanks for this!

It looks like mw-script-restricted-deploy has privileges that match mw-script-restricted, not mw-script-deploy.

root@deploy2002:~# kube_env mw-script-restricted-deploy codfw
root@deploy2002:~# kubectl auth can-i create pods/attach
no
root@deploy2002:~# kubectl auth can-i delete job
no

root@deploy2002:~# kube_env mw-script-deploy codfw
root@deploy2002:~# kubectl auth can-i create pods/attach
yes
root@deploy2002:~# kubectl auth can-i delete job
yes

We're doing the right thing with the hieradata entries you added (at deployment_server.pp:87 we check for whether the username ends in -deploy, which it does, so the cert gets { 'organisation' => 'deploy' } correctly).

But helmfile_namespaces.yaml:67 only binds the deployer role to a single user ${namespace}-deploy. We could override it with deployUser in the values, but we can't currently have two. Same at 95 for the PSP.

What's the best way to go here? We could arrange for that template to take an array deployUsers instead of a single string, but I don't know if that violates an assumption anywhere else that we only have one. (Or we might be able to back out and let members of restricted use the original mw-script{,-deploy} config files? That would mean we modify the permissions on those original configs instead.)

Change #1120462 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add second pair of kubeconfig files for restricted users

https://gerrit.wikimedia.org/r/1120462

Change #1120464 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] pki::get_cert: Allow to get the same cert twice

https://gerrit.wikimedia.org/r/1120464

Change #1121443 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Support multiple Kubernetes configs in mwscript-k8s

https://gerrit.wikimedia.org/r/1121443

Change #1121455 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Read mwscript-k8s MW image from values, not kube API

https://gerrit.wikimedia.org/r/1121455

Change #1121443 merged by RLazarus:

[operations/puppet@production] deployment_server: Support multiple Kubernetes configs in mwscript-k8s

https://gerrit.wikimedia.org/r/1121443

Change #1121455 merged by RLazarus:

[operations/puppet@production] deployment_server: Read mwscript-k8s MW image from values, not kube API

https://gerrit.wikimedia.org/r/1121455

Change #1120464 merged by JMeybohm:

[operations/puppet@production] pki::get_cert: Allow to get the same cert twice

https://gerrit.wikimedia.org/r/1120464

Sorry, my bad. There is an option to choose a different name for the kubeconfig files generated which unfortunately does not support re-using the same user(-certificate) currently. I've added a patch to profile::pki::get_cert() that would solve that (e.g. allowing to "request" the same certificate twice) but that might be a bit invasive and requires a rubberstamp from PKI maintainers I'd say (and I totally forgot to submit this comment last week).

Change #1120462 merged by JMeybohm:

[operations/puppet@production] Add second pair of kubeconfig files for restricted users

https://gerrit.wikimedia.org/r/1120462

Change #1122259 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add mw-script-restricted config to mwscript-k8s

https://gerrit.wikimedia.org/r/1122259

Change #1122259 merged by RLazarus:

[operations/puppet@production] deployment_server: Add mw-script-restricted config to mwscript-k8s

https://gerrit.wikimedia.org/r/1122259

Change #1122666 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Pass kubeConfig in helmfile state values

https://gerrit.wikimedia.org/r/1122666

Change #1122666 merged by RLazarus:

[operations/puppet@production] deployment_server: Pass kubeConfig in helmfile state values

https://gerrit.wikimedia.org/r/1122666

Hmm, @jrbs, who's helping me test (as a member of restricted) reports getting this:

[...]
  Error: Kubernetes cluster unreachable: invalid configuration: [unable to read client-cert /etc/kubernetes/pki/wikikube__mw-script-deploy.pem for mw-script-deploy due to open /etc/kubernetes/pki/wikikube__mw-script-deploy.pem: permission denied, unable to read client-key /etc/kubernetes/pki/wikikube__mw-script-deploy-key.pem for mw-script-deploy due to open /etc/kubernetes/pki/wikikube__mw-script-deploy-key.pem: permission denied]
  Error: plugin "diff" exited with error

☠️️ Command failed with status 1: /usr/bin/helmfile --quiet --file /srv/deployment-charts/helmfile.d/services/mw-script/helmfile.yaml --state-values-set kubeConfig=/etc/kubernetes/mw-script-restricted-deploy-codfw.config --environment codfw --selector name=2d5dvah8 apply --values /tmp/tmpaaa4yqbg --suppress-diff

So /etc/kubernetes/mw-script-restricted-deploy-codfw.config (which is readable to restricted) points at /etc/kubernetes/pki/wikikube__mw-script-deploy-key.pem (which isn't). Catting the certificate shows the same under client-certificate/client-key. There's also a wikikube__mw-script-restricted-deploy-key.pem that's readable though.

@JMeybohm Normally I'd guess we should pass $user['kubeconfig'] rather than $user['name'] to profile::pki::get_cert() (deployment_server.pp:92), but I know you went out of your way not to -- what do you think?

Indeed we can't use the same cert for both kubeconfig files given the different permissions required. Unfortunately the second parameter to get_cert is the certs common name, which is the "user" in k8s means of authentication. So changing that would bring us back to square one where we don't have appropriate RBAC rules in place.
(The`wikikube__mw-script-restricted-deploy-key.pem` is a leftover from the initial attempt that has not been made absent in puppet.)

I think we have two options:

  1. Add functionality to profile::pki::get_cert() that allows us to specify a different title (e.g. file name) for the certificate files. Currently the names are derived from the CA and common name of the certificate (e.g. we can't get two certificates with different permissions for the same CA and common name right now).
  2. Dump profile::kubernetes::deployment_server::services: to a YAML file on the deployment servers (like in global_config.pp) to feed into admin_ng/helmfile_namespaces.yaml in order to create proper RBAC rules for all users requested.

The second options seems a bit more correct to me, but probably also requires a bit more work - we also need to take the CI snowflake into account or remove it (T288629: Automated validation of mediawiki-multiversion images).

Note this makes it challenging to post the "stream logs" command at tasks (see what I had to do in T385780#10590355). Unfortunately, we can't just print the restricted path, as deployment doesn't have access to that file (deployers can sudo to that file through mwdeploy and do something like K8S_CLUSTER=codfw KUBECONFIG=/etc/kubernetes/mw-script-restricted-codfw.config sudo -u mwdeploy -E kubectl logs -f job/mw-script.codfw.tid05s4u mediawiki-tid05s4u-app, but that is far from ideal). Would it be possible to also create a single command for streaming logs that members of both groups would be able to use?

Change #1126940 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] global_config: Add kubernetesVersion for each environment/cluster

https://gerrit.wikimedia.org/r/1126940

Change #1126965 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] helmfile: Dump data about each service (users, namespace etc.) to yaml

https://gerrit.wikimedia.org/r/1126965

Change #1127064 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Revert "Add second pair of kubeconfig files for restricted users"

https://gerrit.wikimedia.org/r/1127064

Change #1127085 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] helmfile_namespaces.yaml: Replace deprecated .Environment.Values

https://gerrit.wikimedia.org/r/1127085

Change #1127086 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] helmfile_namespaces: Merge hiera services with admin_ng namespaces

https://gerrit.wikimedia.org/r/1127086

Change #1127087 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update admin_ng fixtures to reflect puppet changes

https://gerrit.wikimedia.org/r/1127087

I tried to do the right thing and integrated the kuberenetes 'services' data from hiera into the helmfile deployment. While this works "in production" I spend a good chunk of the day trying to enable the CI to work with this - and failed.
I've included two new files for each cluster/environment in helmfile.d/admin_ng/helmfile.yaml (/etc/helmfile-defaults/{general,services)-*.yaml) which are obviously not available in CI. There is a bunch of code to replace (like in 'sed') helmfiles to replace those /etc/ files with .fixture equivalents but as of now I did not find a way to properly do that for the "main" helmfile of admin_ng.
Maybe you see what I could not - or you manage to lure the actual ruby pro into this ;)

Change #1127085 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile_namespaces.yaml: Replace deprecated .Environment.Values

https://gerrit.wikimedia.org/r/1127085

Change #1127086 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile_namespaces: Merge hiera services with admin_ng namespaces

https://gerrit.wikimedia.org/r/1127086

Change #1127064 merged by RLazarus:

[operations/puppet@production] Revert "Add second pair of kubeconfig files for restricted users"

https://gerrit.wikimedia.org/r/1127064

Change #1126965 merged by RLazarus:

[operations/puppet@production] helmfile: Dump data about each service (users, namespace etc.) to yaml

https://gerrit.wikimedia.org/r/1126965

Change #1137339 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Fix a type mismatch

https://gerrit.wikimedia.org/r/1137339

Change #1137339 merged by RLazarus:

[operations/puppet@production] deployment_server: Fix a type mismatch

https://gerrit.wikimedia.org/r/1137339

Change #1138494 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Read RoleBinding usernames from services hiera

https://gerrit.wikimedia.org/r/1138494

Change #1138494 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Read RoleBinding usernames from services hiera

https://gerrit.wikimedia.org/r/1138494

Change #1138938 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Read RoleBinding usernames from services hiera, correctly

https://gerrit.wikimedia.org/r/1138938

Change #1138938 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Read RoleBinding usernames from services hiera, correctly

https://gerrit.wikimedia.org/r/1138938

Change #1138951 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Use ~/.cache/helm if /var/cache/helm isn't writable

https://gerrit.wikimedia.org/r/1138951

Change #1138951 merged by RLazarus:

[operations/puppet@production] deployment_server: Use ~/.cache/helm if /var/cache/helm isn't writable

https://gerrit.wikimedia.org/r/1138951

Change #1127087 merged by jenkins-bot:

[operations/deployment-charts@master] Update admin_ng fixtures to reflect puppet changes

https://gerrit.wikimedia.org/r/1127087

JMeybohm reopened this task as Open.EditedMay 13 2025, 3:54 PM

I've reverted https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127087 because it triggers a race condition in CI (revert at: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1145248)

Reopening this so we don't forget.

08:38:34 rake aborted!
08:38:34 NoMethodError: undefined method `filter!' for true:TrueClass
08:38:34 /src/.rake_modules/tester/asset.rb:729:in `filter_fixtures'
08:38:34 /src/.rake_modules/tester/asset.rb:723:in `collect_fixtures'
08:38:34 /src/.rake_modules/tester/asset.rb:49:in `initialize'
08:38:34 /src/.rake_modules/tester/asset.rb:398:in `initialize'
08:38:34 /src/.rake_modules/tester/asset.rb:715:in `initialize'
08:38:34 /src/Rakefile:713:in `new'
08:38:34 /src/Rakefile:713:in `tasklist_from_changes'
08:38:34 /src/Rakefile:623:in `block in <top (required)>'
08:38:34 /usr/share/rubygems-integration/all/gems/rake-13.0.3/exe/rake:27:in `<top (required)>'
08:38:34 Tasks: TOP => default => check_change
08:38:34 (See full trace by running task with --trace)
08:38:34 Build step 'Execute shell' marked build as failur

Change #1145857 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Reapply "Update admin_ng fixtures to reflect puppet changes"

https://gerrit.wikimedia.org/r/1145857

Change #1145857 merged by jenkins-bot:

[operations/deployment-charts@master] Reapply "Update admin_ng fixtures to reflect puppet changes"

https://gerrit.wikimedia.org/r/1145857

Resolving this again since the updated patch does, for some reason, no longer trigger the race.

Change #1149609 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update admin_ng fixtures to reflect puppet changes

https://gerrit.wikimedia.org/r/1149609

Change #1149609 merged by jenkins-bot:

[operations/deployment-charts@master] Update admin_ng fixtures to reflect puppet changes

https://gerrit.wikimedia.org/r/1149609