Page MenuHomePhabricator

Migrate to helm v3
Open, MediumPublic

Description

Meta-task tracking things needed for helm 3 migration

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 592967 had a related patch set uploaded (by Hashar; owner: JMeybohm):
[operations/debs/helm3@master] Add debian directory and .gitreview

https://gerrit.wikimedia.org/r/592967

Change 592967 merged by JMeybohm:
[operations/debs/helm3@master] Add debian directory and .gitreview

https://gerrit.wikimedia.org/r/592967

colewhite triaged this task as Medium priority.May 4 2020, 11:13 PM

Change 594564 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Make debian-glue voting for helm3

https://gerrit.wikimedia.org/r/594564

Change 594564 merged by jenkins-bot:
[integration/config@master] Make debian-glue voting for helm3

https://gerrit.wikimedia.org/r/594564

Since we don't use helm3 in production yet, just adding it to the task. There's a new vulnerability specific to 3.x, so let's upgrade to 3.2.4 before we roll this out:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-4053

Change 642367 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm3@master] New upstream version v3.4.1

https://gerrit.wikimedia.org/r/642367

Change 642367 merged by JMeybohm:
[operations/debs/helm3@master] New upstream version v3.4.1

https://gerrit.wikimedia.org/r/642367

Change 643243 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helmfile@master] Import new helmfile upstream version 0.135.0

https://gerrit.wikimedia.org/r/643243

Change 643243 merged by JMeybohm:
[operations/debs/helmfile@master] Import new helmfile upstream version 0.135.0

https://gerrit.wikimedia.org/r/643243

Change 643500 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] helm: Add helm3 support to helm module

https://gerrit.wikimedia.org/r/643500

Change 643501 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm-diff@master] New upstream version 3.1.3 plus helm3 support

https://gerrit.wikimedia.org/r/643501

Change 643501 merged by JMeybohm:
[operations/debs/helm-diff@master] New upstream version 3.1.3 plus helm3 support

https://gerrit.wikimedia.org/r/643501

Change 643500 merged by JMeybohm:
[operations/puppet@production] helm: Add helm3 support to helm module

https://gerrit.wikimedia.org/r/643500

Change 643745 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] helm: Fix environment syntax in exec

https://gerrit.wikimedia.org/r/643745

Change 643745 merged by JMeybohm:
[operations/puppet@production] helm: Fix environment syntax in exec

https://gerrit.wikimedia.org/r/643745

Need to update/fix helm (2) to satisfy helmfile's "helm version" parser (it panics again)

Change 643884 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm@master] New upsteam version 2.17.0

https://gerrit.wikimedia.org/r/643884

Change 643884 merged by JMeybohm:
[operations/debs/helm@master] New upsteam version 2.17.0

https://gerrit.wikimedia.org/r/643884

Change 643907 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[integration/config@master] helm-linter: Update helm and components, add helm3

https://gerrit.wikimedia.org/r/643907

Change 643908 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[integration/config@master] jjb: update job to releng/helm-linter:0.2.11

https://gerrit.wikimedia.org/r/643908

Change 643907 merged by jenkins-bot:
[integration/config@master] helm-linter: Update helm and components, add helm3

https://gerrit.wikimedia.org/r/643907

The CI Docker image failed to build:

Creating /tmp/helm/repository 
Creating /tmp/helm/repository/cache (image.py:179)
Creating /tmp/helm/repository/local 
Creating /tmp/helm/plugins (image.py:179)
Creating /tmp/helm/starters (image.py:179)
Creating /tmp/helm/cache/archive (image.py:179)
Creating /tmp/helm/repository/repositories.yaml 
Adding stable repo with URL: https://charts.helm.sh/stable (image.py:179)
Adding local repo with URL: http://127.0.0.1:8879/charts (image.py:179)
$HELM_HOME has been configured at /tmp/helm.
Not installing Tiller due to 'client-only' flag having been set (image.py:179)
"wmf-stable" has been added to your repositories (image.py:179)
"incubator" has been added to your repositories (image.py:179)
"wmf-stable" has been added to your repositories (image.py:179)
Error: repo "https://kubernetes-charts-incubator.storage.googleapis.com" is no longer available; try "https://charts.helm.sh/incubator" instead
 (image.py:179)

I guess the helm3 Debian package is obsolete?

Thanks @hashar
I'll take a look if what we use from the old incubator has made it to the new one.

I guess the helm3 Debian package is obsolete?

No, it actually contains helm version 3 (available as helm3) whereas the package helm contains version 2 (available as helm/helm2) so that we can install/use both versions in parallel.

Mentioned in SAL (#wikimedia-releng) [2020-11-27T16:08:29Z] <hashar> Successfully tagged docker-registry.discovery.wmnet/releng/helm-linter:0.2.11 for jayme / T251305

Change 643908 merged by jenkins-bot:
[integration/config@master] jjb: update job to releng/helm-linter:0.2.11

https://gerrit.wikimedia.org/r/643908

https://github.com/helm/helm/issues/8271 says that --recreatepods won't work in helm3, we need to find an alternative.

helm test annotations changed a bit:

Note that until Helm v3, the job definition needed to contain one of these helm test hook annotations: helm.sh/hook: test-success or helm.sh/hook: test-failure. helm.sh/hook: test-success is still accepted as a backwards-compatible alternative to helm.sh/hook: test.

We might also want to add "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded as of https://helm.sh/docs/topics/charts_hooks/#hook-deletion-policies

JMeybohm updated the task description. (Show Details)

As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplified helmfile using helm3, and the command

helmfile -b helm3 --state-values-set roll_restart=1 sync

does effectively restart all pods even when we're using helm3.

As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplified helmfile using helm3, and the command

helmfile -b helm3 --state-values-set roll_restart=1 sync

does effectively restart all pods even when we're using helm3.

FTR: It seems everybody read the announcement/docs wrong or the helm guys rolled back with it. It currently says[1]:

NOTE: In the past we recommended using the --recreate-pods flag as another option. This flag has been marked as deprecated in Helm 3 in favor of the more declarative method above.

So this means we can probably get away and can continue using it throughout helm3.

[1] https://v3.helm.sh/docs/howto/charts_tips_and_tricks/#automatically-roll-deployments

elukey closed subtask Restricted Task as Resolved.Aug 5 2021, 5:44 PM

The helm binary in helmfile can be set using the --helm-binary option or by setting helmBinary in the helmfile.yaml.
It can be set globally (like in admin-ng). But it is also possible to to change the helm binary depending on the environment.

Here is one example on how to set the binary for miscweb:

helmDefaults:
  tillerNamespace: miscweb
  verify: false
  atomic: true
  timeout: 600
  force: false
  args:
    - "--kubeconfig=/etc/kubernetes/miscweb-{{ .Environment.Name }}.config"
    - "--helm-binary={{ .Environment.Values.helmBinary }}"

In the environments section in the helmfile.yaml we define the binary for each environment:

environments:
  staging:
    values:
      - releases: [main]
      - helmBinary: helm3
  eqiad:
    values:
      - releases: [main]
      - helmBinary: helm2
  codfw:
    values:
      - releases: [main]
      - helmBinary: helm2

With this approach it would be possible to gradually change the helm binary and migrate one service per environment independently. However this would mean a lot of changes in the helmfiles.

Another approach would be to introduce environment default values (not to be confused with template environment defaults, which already exist in /etc/helmfile-defaults/general-{{ .Environment.Name }}.yaml and are used for the actual helm template not helmfile). It would be possible to define environment default values in a central file and change the helm binary centrally.

environments:
  staging:
    values:
      - releases: [main]
      - "/etc/helmfile-defaults/helmfile-staging.yaml" # helmfile environment default values, controlled by SRE 
  eqiad:
    values:
      - releases: [main]
      - "/etc/helmfile-defaults/helmfile-eqiad.yaml" # helmfile environment default values, controlled by SRE 
  codfw:
    values:
      - releases: [main]
      - "/etc/helmfile-defaults/helmfile-codfw.yaml" # helmfile environment default values, controlled by SRE

With this approach we could migrate a whole environment at a time. So we could depool a cluster, change the binary and redeploy everything using helm3.

The helm binary in helmfile can be set using the --helm-binary option or by setting helmBinary in the helmfile.yaml.
[...]

That is pretty cool, thanks! Did you actually deploy something using helm3 with tillerNamespace: still set? Is it just ignored in that case (helmfile can be really picky :))?

That is pretty cool, thanks! Did you actually deploy something using helm3 with tillerNamespace: still set? Is it just ignored in that case (helmfile can be really picky :))?

yes tillerNamespace was set and seems to be ignored with helm3. I've done a deploy in a local Kubernetes environment. Removing tillerNamespace seems to have no effect either with helm3.

However my example with --helm-binary is not correct. I tested some noops yesterday without any error. But if you try to deploy a change this will fail. helmDefaults.args are passed to helm, not helmfile. So it is not possible to set the helm binary using this option. But we can just use helmBinary and it works as expected:

helmBinary: "{{ .Environment.Values.helmBinary }}"
helmDefaults:
  tillerNamespace: miscweb
  verify: false
  atomic: true
  timeout: 600
  force: false
  args:
    - "--kubeconfig=/etc/kubernetes/miscweb-{{ .Environment.Name }}.config"
#    - "--helm-binary={{ .Environment.Values.helmBinary }}" # not working, helm has no option --helm-binary

And some more input regarding RBAC and the replacement of Tiller service account:

Currently two different users exist for a service deployment. One is the less privileged viewer user which is named similar to the service and has mostly view permissions (and portforward). The second user is the more privileged tiller service account (see helmfile_rbac.yaml. With the migration to helm3 Tiller will be gone so a new RBAC solution has to be found.

I see the following RBAC options:

  1. extend permissions of viewer service user to also do helmfile deployments
  2. introduce a second user for helmfile deploys only
  3. add a Tiller-like component to the cluster with separate service account (like helm-controller)

1. extend permissions of viewer user

Would be simple to implement but would also introduce additional security risks. So I don't go into more detail.

2. introduce a second user for helmfile

This option would keep the less privileged viewer user for every service. For helmfile deployments a more privileged user would be introduced. This user would be named <servie_name>-helmfile with similar namespaced permissions as the former Tiller serviceaccount. The serviceaccount can be used in the --kubeconfig parameter in helmfile.yaml. With this approach day-to-day access with kubectl and kube_env would be limited to read-only access. Deployments would happen with a more privileged service account. However this solution would mean that you could escalate you privileges in every service namespace, not by accident but on purpose when using the corresponding kube.config file. So the use outside of helmfile would be strongly discouraged.

3. add a Tiller-like component

Another approach would be to add some kind of gate or layer between helmfile and the actual deployment and restrict access to the privileged user. This would mean to build some Tiller-like component or use something like helm-controller. But adding another layer of abstraction would also increase the complexity even more. And using some kind of helm operator would require quite a bit of refactoring in the current workflow.

final thoughts

In the new setup I would like to use two users for enhanced security and to prevent accidental write access. So my proposal would be option 2. I would like to hear your thoughts especially regarding security of this proposal. Introducing new privileged users for every namespace which are accessible by non-root users may be a security risk. However compared to the current helm setup I don't see a new attack vector, as the same can be achieved with helm access as well.

Thanks for writing this up. As discussed already I'm in for option 2 as well as it keeps things "mostly" as they are. As you said someone with access to tiller (e.g. every deployer) can escalate to tiller privileges anyways currently (by just telling tiller what to do). It just gets a bit easier to do so if we provide a second kubeconfig.

I'd suppose to name the new user <service-name>-deploy or <service-name>-rw though, to not have a helmfile reference in there (who knows what we use in N months ;-) ).

There is a ClusterRole named deploy already for the aggregation of view and pods/portForward permissions. So I would prefer using the names <service-name> and <service-name>-rw for the two users. If we use the deploy in the name of the new user this may be confused with the read-only user and the deploy ClusterRole.

There is a ClusterRole named deploy already for the aggregation of view and pods/portForward permissions. So I would prefer using the names <service-name> and <service-name>-rw for the two users. If we use the deploy in the name of the new user this may be confused with the read-only user and the deploy ClusterRole.

In general I don't think it's a problem if API object of different kind share the same name (as they are ultimately easy to distinguish by kind).
But you absolutely have a point as the naming becomes pretty weird here:
If we do as proposed we end up with users called <service-name> which are bound via a RoleBinding called deploy to the ClusterRole view and they are ultimately unable to deploy as they only have read only access.
Plus a bunch of users called <service-name>-rw which are actually used to deploy stuff - this does not feel right.

What about:

  • We rename the aggregating ClusterRole to: view-and-portforward (so it's named after what it does, cosmetics only)
  • We invent a new ClusterRole: deploy with the permissions of the current tiller ClusterRole (but we need to have more than one, see tiller-flink)
  • We call the users:
    • <service-name>: Bound to ClusterRole view via RoleBinding called view (this RoleBinding is currently called deploy)
    • <service-name>-deploy or <service-name>-rw: Bound to ClusterRole deploy via RoleBinding called deploy

(Turns out I actually like calling different but connected API objects the same ;-) )

Change 715266 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] helmfile.d admin rename view rbac resources

https://gerrit.wikimedia.org/r/715266

Change 715266 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d admin rename view rbac resources

https://gerrit.wikimedia.org/r/715266

Change 715498 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] helmfile.d admin add dedicated deploy user

https://gerrit.wikimedia.org/r/715498

Thanks for this writeup.

Let me start by saying that of the 3 solutions, the basic idea of the 3rd one should be the one that we aim for in the long run (but not now). Deployments where a, well written, reviewed and most importantly auditing, agent performs all the potentially destructive actions and the human operator does not have those privileges is a good endeavor. However it's also pretty far down the road and we got other itches to scratch right now. So out of 1 and 2, I 'd love 2, but it comes with a big caveat. We 'll have to manually create the set of 2 users manually (and that's totally manual yet) each time. The only way this is going to be acceptable, even if just medium term, if we attack that front next and try to automate it. If we can't do so (and I 'd prefer we do) , I 'd much rather go for option 1)

Before I dive into the naming bikeshedding I have one question regarding the following 2 sentences

In the new setup I would like to use two users for enhanced security and to prevent accidental write access. So my proposal would be option 2. I would like to hear your thoughts especially regarding security of this proposal. Introducing new privileged users for every namespace which are accessible by non-root users may be a security risk. However compared to the current helm setup I don't see a new attack vector, as the same can be achieved with helm access as well.
As you said someone with access to tiller (e.g. every deployer) can escalate to tiller privileges anyways currently (by just telling tiller what to do). It just gets a bit easier to do so if we provide a second kubeconfig.

The attack path being that one would send to tiller a malicious helm release and tiller would happily oblige (up to its RBAC capabilities of course, e.g. no new users)? That's surely the case, but there is 1 extra function that tiller would do there. Keep the history of the releases, so we would have some auditing capabilities. Yes, I know delete --purge would remove that audit history from tiller, this isn't so much about fighting a truly serious, knowledgeable and patient attacker but rather catch mistakes. From my understanding regarding helm3 we would keep this functionality as it's now stored as a secret in kubernetes API. Unfortunately it's a bit easier to mess with.

Anyway, on the naming side, view-and-portforward sounds ok to me, and naming the actual deploying role (the one used by tiller now) as deploy is also fine. As for the per service accounts, <service-name> and <service-name>-deploy to map more easily to the roles.

Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call ml-services dir under helmfile.d (see T286791). I would love to start with helm3 straight away, and we could think about using the ml-serve-eqiad cluster as testbed in case it fits (we are testing an MVP, nothing in production yet). Let me know!

Change 715498 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d admin add dedicated deploy user

https://gerrit.wikimedia.org/r/715498

Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call ml-services dir under helmfile.d (see T286791). I would love to start with helm3 straight away, and we could think about using the ml-serve-eqiad cluster as testbed in case it fits (we are testing an MVP, nothing in production yet). Let me know!

Thanks for the support! I would recommend to take a look at helmfile.d/admin_ng, which deploys all the base components to our Kubernete clusters. admin_ng is using helm3 for deployment already. So if you are starting with a fresh cluster, this should be quite easy. Just make sure to use the right binary (helmBinary: helm3) and have proper RBAC configuration in place.

One open topic is about additional annotations for helm test. But I think we will figure this out soon as well.

If you have helmfiles or rbac configuration to review feel free to add me.

helm test annotations changed a bit:

Note that until Helm v3, the job definition needed to contain one of these helm test hook annotations: helm.sh/hook: test-success or helm.sh/hook: test-failure. helm.sh/hook: test-success is still accepted as a backwards-compatible alternative to helm.sh/hook: test.

We might also want to add "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded as of https://helm.sh/docs/topics/charts_hooks/#hook-deletion-policies

I checked all charts for deprecated and removed helm annotations. We don't use "helm.sh/hook": test-failure. This annotation is removed in helm3 (see).

We use "helm.sh/hook": test-success in most charts. This annotations is deprecated in helm3. But running helmfile test works fine also with deprecated annotations.

helm3 also removed the "helm.sh/hook": crd-install annotation. We don't use this specific annotation but in charts/knative-serving-crds we use some CRDs and knative.dev/crd-install: "true" annotations. I have to check if this CRDs will get created with helm3 properly.

@JMeybohm You also mentioned "helm.sh/hook-delete-policy". My understanding is that the delete policy did not change from helm2 to helm3 (see and v2_v3_migration)).

So I don't think we have to change any annotation before migrating to helm3. "helm.sh/hook": test-success is compatible with helm3 and can be changed after the migration. "helm.sh/hook-delete-policy" is more a general discussion on how we want to treat test objects, which I see unrelated to helm3. Do you want to change any of the mentioned annotations before/during the helm3 migration? I would like to change as little as possible during the migration and remove deprecated annotations later.

I checked all charts for deprecated and removed helm annotations. We don't use "helm.sh/hook": test-failure. This annotation is removed in helm3 (see).

We use "helm.sh/hook": test-success in most charts. This annotations is deprecated in helm3. But running helmfile test works fine also with deprecated annotations.

That's pretty nice!

helm3 also removed the "helm.sh/hook": crd-install annotation. We don't use this specific annotation but in charts/knative-serving-crds we use some CRDs and knative.dev/crd-install: "true" annotations. I have to check if this CRDs will get created with helm3 properly.

Don't worry about knative annotations. knative is installed via helm3 already.

@JMeybohm You also mentioned "helm.sh/hook-delete-policy". My understanding is that the delete policy did not change from helm2 to helm3 (see and v2_v3_migration)).

So I don't think we have to change any annotation before migrating to helm3. "helm.sh/hook": test-success is compatible with helm3 and can be changed after the migration. "helm.sh/hook-delete-policy" is more a general discussion on how we want to treat test objects, which I see unrelated to helm3. Do you want to change any of the mentioned annotations before/during the helm3 migration? I would like to change as little as possible during the migration and remove deprecated annotations later.

I agree. We can have the migration (and potential addition of hook-delete-policy) as follow up then. Great.

Some additional RBAC requirements:

on releases1002 and releases2002 helm is used as well.
So when migrating, we have to make sure that the user lines up with the user we define.

We also have to make sure to update the helm client there as well:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/releases/mediawiki.pp#23

Change 720342 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] helmfile.d/admin make tiller components configurable per environment

https://gerrit.wikimedia.org/r/720342

Change 720342 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d/admin make tiller components configurable per environment

https://gerrit.wikimedia.org/r/720342

Change 721301 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] services: deploy services with helm3

https://gerrit.wikimedia.org/r/721301

Change 721373 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] hier::common::deployment_server add environment helmfile-defaults

https://gerrit.wikimedia.org/r/721373

My current plan for cluster-wise migration and deploying services with helm3 is:

  • make sure cluster is depooled
  • delete helm releases for all services
  • remove tiller components (feature added in change 720342, set value helmVersion to helm3 in admin_ng for given cluster)
  • switch helm binary to helm3 for all service deployments (feature proposed in change 721301, set value helmBinary to helm3 in /etc/helmfile-defaults for given cluster)
  • redeploy all services with helmfile
  • make sure everything works
  • repeat for other clusters

For staging-codfw I proposed a change introducing .Environment.Value.kubeConfig. With this value we can test everything also in staging-codfw by running:

helmfile --state-values-set kubeConfig=/etc/kubernetes/<service>-staging-codfw.config -e staging apply

I think this feature could be useful after the migration to deploy to the depooled staging cluster for testing as well. It may introduce the risk to deploy to different environments than what is logged in SAL. So you could pretend to deploy to staging and actually deploy to production.

To further test this process -deploy users are needed (see above comment). @akosiaris you mentioned that you would like to refactor the user "management". Should I wait for a change from your site or do you need support? It would also be possible for me to add the new users just like the existing ones (at least in staging-codfw).

Jelto updated the task description. (Show Details)