Page MenuHomePhabricator

Kubernetes credentials on deployment servers should be available to deployers, not all users
Closed, ResolvedPublic

Description

Currently the kubeconfig credentials are available to the wikidev group, which includes anyone who has a login on that system.
That hardly makes sense as it should probably be restricted to the deployment group.

This group is basically thought as "anyone who can deploy code, minus ops", so my proposal would be to:

  • Add ops_members to the deployment group
  • Switch the kubeconfig files to be in the deployment group.

This will also unbreak the issue we currently have with the deploy-to-mwdebug service not being able to run as mwbuilder.

If we do this, we can revert change attached to this task.

Event Timeline

Change 778568 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mwdebug_deploy: switch back to using the root user

https://gerrit.wikimedia.org/r/778568

Change 778568 merged by Giuseppe Lavagetto:

[operations/puppet@production] mwdebug_deploy: switch back to using the root user

https://gerrit.wikimedia.org/r/778568

JMeybohm added a subscriber: thcipriani.

I'll prepare the needed patches.

/CC @thcipriani as this will alter the deployment group where you're listed as approver

Change 779047 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add all members of the ops group to the deployment group

https://gerrit.wikimedia.org/r/779047

Change 779048 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Switch default group for Kubernetes credentials files to deployer

https://gerrit.wikimedia.org/r/779048

I'll prepare the needed patches.

/CC @thcipriani as this will alter the deployment group where you're listed as approver

👍 restricting to deployment seems sensible, and I like the proposal—approved as approver.

Change 779047 merged by JMeybohm:

[operations/puppet@production] Add all members of the ops group to the deployment group

https://gerrit.wikimedia.org/r/779047

Mentioned in SAL (#wikimedia-operations) [2022-04-14T08:36:46Z] <jayme> added ops members to deplotment group - T305729

Change 779048 merged by JMeybohm:

[operations/puppet@production] Switch default group for Kubernetes credentials files to deployment

https://gerrit.wikimedia.org/r/779048

Change 780629 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Revert "mwdebug_deploy: switch back to using the root user"

https://gerrit.wikimedia.org/r/780629

Change 780629 merged by RLazarus:

[operations/puppet@production] Revert "mwdebug_deploy: switch back to using the root user"

https://gerrit.wikimedia.org/r/780629

Change 780632 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] Revert "Revert "mwdebug_deploy: switch back to using the root user""

https://gerrit.wikimedia.org/r/780632

Change 780632 merged by RLazarus:

[operations/puppet@production] Revert "Revert "mwdebug_deploy: switch back to using the root user""

https://gerrit.wikimedia.org/r/780632

Problems detected today:

  • WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mwdebug-deploy-eqiad.config is display by helmfile. There are times when it seems like it is upgrading the warning to an error, but it's unclear .
  • helmfile apply wants to write to var/cache/helm/repository/ which is owned by helm:wikidev and is group-writable. At the time of testing we're either running as mwbuilder or mwdeploy, neither of which is in the wikidev group.
  • WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mwdebug-deploy-eqiad.config is display by helmfile. There are times when it seems like it is upgrading the warning to an error, but it's unclear .

I don't think this ever gets elevated to en error (but I do get that it is annoying).

  • helmfile apply wants to write to var/cache/helm/repository/ which is owned by helm:wikidev and is group-writable. At the time of testing we're either running as mwbuilder or mwdeploy, neither of which is in the wikidev group.

Indeed /etc/helm (HELM_CONFIG_HOME), /usr/share/helm (HELM_DATA_HOME) and /var/cache/helm (HELM_CACHE_HOME) are owned by helm:wikidev and 0775 currently. I don't think the first two need that broad permissions (will change to helm:helm, 0755), HELM_CACHE_HOME I'll change to be owned by deployment group.

Change 786269 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Fix permissions/ownership of helm directories

https://gerrit.wikimedia.org/r/786269

Change 786269 merged by JMeybohm:

[operations/puppet@production] Fix permissions/ownership of helm directories

https://gerrit.wikimedia.org/r/786269

Change 787504 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Make /etc/helmfile-defaults/private readable by deployment group

https://gerrit.wikimedia.org/r/787504

Change 787504 merged by JMeybohm:

[operations/puppet@production] Make /etc/helmfile-defaults/private readable by deployment group

https://gerrit.wikimedia.org/r/787504

Thanks for the adjustments. Everything seems to be working ok now.

Hi folks! In T307927 I am trying to figure out why ml-team deployers (not in the deployment group) are not able to use helmfile/helm when there are charts changes, is it possible that this is due to the HELM_CACHE_HOME perm changes? The ml-team can opt-in the deployment group without issues, we didn't need it since we only deploy to the ML team's k8s clusters.

Hi folks! In T307927 I am trying to figure out why ml-team deployers (not in the deployment group) are not able to use helmfile/helm when there are charts changes, is it possible that this is due to the HELM_CACHE_HOME perm changes? The ml-team can opt-in the deployment group without issues, we didn't need it since we only deploy to the ML team's k8s clusters.

Yeah, that's probably the reason as helm tries to update the cache. So basically everyone that needs to run helm needs to write to the cache dir. Alternatively we can probably stop declaring HELM_CACHE_HOME for all users and live with a small performance penalty.

Hi folks! In T307927 I am trying to figure out why ml-team deployers (not in the deployment group) are not able to use helmfile/helm when there are charts changes, is it possible that this is due to the HELM_CACHE_HOME perm changes? The ml-team can opt-in the deployment group without issues, we didn't need it since we only deploy to the ML team's k8s clusters.

I guess it could make sense to have separate HELM_CACHE_HOME directories for different groups then?

I confirm from Aiko's tests that HELM_CACHE_HOME is the problem, so we can try to set it differently for various groups. From what I can see in puppet, it could be as simple as:

  1. creating a new dir via puppet called /var/cache/ml/helm or similar with proper permissions
  2. set it conditionally in kube-env.sh if the user is in the right group (in this case, deploy-ml-services).

The default would be /var/cache/helm of course.

Just to note a semantic thing here:

I am trying to figure out why ml-team deployers (not in the deployment group)

So, somehow some deployers are not in a group called deployment. I hate to be pedantic, but this just sounds confusing (to say the least). Aside from the technical solution, whatever that may be, either the deployment group should be renamed to more correctly reflect the purpose it has, or the aforementioned deployers should to be added to the group.

Just to note a semantic thing here:

I am trying to figure out why ml-team deployers (not in the deployment group)

So, somehow some deployers are not in a group called deployment. I hate to be pedantic, but this just sounds confusing (to say the least). Aside from the technical solution, whatever that may be, either the deployment group should be renamed to more correctly reflect the purpose it has, or the aforementioned deployers should to be added to the group.

Yeah, I think we have a chance to start introducing multitenancy into our deployment process. Right now anyone with deployment rights can deploy anything. Case in point, people who don't have +2 on operations/mediawiki-config can currently deploy mediawiki. Same, I don't think anyone maintaining restbase should be able to deploy termbox.

This has been kind-of implemented in scap3 using different keyholders tied to different deployment groups. We might want to do something similar for hemlfile as well.

There's a lot of stuff to figure out before that's doable, but separating HELM_CACHE_HOME might be enough not to make the situation worse, and at least keep the group for ML deployers separate from the group that is supposed to deploy on the wikikube cluster.

Puppet fails on the contint* hosts cause /var/cache/helm went from being owned by wikidev to deployment, a group which does not exist on those hosts. Filed as T307740

Reopening since it seems that more discussion is needed :)

Change 791565 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:helm: make the group permissions on helm_cache configurable

https://gerrit.wikimedia.org/r/791565

Do we really think we need a global/shared cache directory? AIUI it is used to:

  1. Cache helm charts pulled from the registry (I doubt there is a real benefit compared to pulling them from the chart repo)
  2. Cache the helm chart repo index (which we auto-update using a timer)

The latter is probably where we're most interested in and that does not even need to be writable for users of helm (but only for the user running the repo update systemd timer). Fortunately with helm3 there is an env variable dedicated to that (HELM_REPOSITORY_CACHE).

So my suggestion would be:

  • Don't set HELM_CACHE_HOME at all (it will default to XDG_CACHE_HOME/helm, so each user will have their own
  • Set HELM_REPOSITORY_CACHE globally (just being writable by the helm user running the systemd timer)

helm 2 did have a different structure regarding these things, it's definitely fine to revisit and re-evaluate the approach.

The main benefit I see with the shared cache (I doubt there is a noticeable perf improvement), is not having to figure out bugs that arise from X different caches for X different users. That being said, not sure those bugs are fewer then having a shared cache. Judging from this, maybe not?

Another use case mentioned in T307927#7921020 is that, IIUC, the /etc/helmfile/private config files changed group ownership as well, impacting ml deployers. For example, Aiko was able to set a home-local HELM_REPOSITORY_CACHE successfully, but then helmfile diff prompted the removal of a Secret due to some private files not readable anymore.

I opened T308308 to add Aiko and Kevin to the deployment group, but as Giuseppe mentioned above it seemed to me not relevant to the ml use case (namely only deploying to the ml-serve clusters via helmfile). It could be a quick solution to unblock us (ML) to deploy, while we think about a more complete solution. Let me know :)

Change 791565 merged by Jbond:

[operations/puppet@production] C:helm: make the group permissions on helm_cache configurable

https://gerrit.wikimedia.org/r/791565

AIUI this is done.