Run helm test after deploy
Open, MediumPublic
Actions

Assigned To

None

Authored By

	JMeybohm
	Mar 9 2021, 3:45 PM

Description

We have a bunch of charts that define helm tests (running service checker). Unfortunately those tests are only run if triggered manually and not during a standard deploy.

We could change that by adding a postsync hook to the helmfiles running the actual tests, like:

diff --git a/helmfile.d/services/termbox/helmfile.yaml b/helmfile.d/services/termbox/helmfile.yaml
index 0fc240f..812cfdb 100644
--- a/helmfile.d/services/termbox/helmfile.yaml
+++ b/helmfile.d/services/termbox/helmfile.yaml
@@ -32,6 +32,10 @@ templates:
             "{{`{{.HelmfileCommand}}`}}",
             "[{{ .Environment.Name }}] Ran '{{`{{.HelmfileCommand}}`}}' command on namespace '{{`{{.Release.Namespace}}`}}' for release '{{`{{.Release.Name}}`}}' .",
           ]
+      - name: "test"
+        events: ["postsync"]
+        command: "helmfile"
+        args: ["-e", "{{ .Environment.Name }}", "-l", "name={{`{{.Release.Name}}`}}", "test", "--cleanup"]
 
 # TODO: convert to the new release naming scheme
 releases:

Of course this does not come without some helm-ish salt:

If the test fails it leaves the pod behind (cool!) but you can't run the test again because it left the pod behind (not cool!)
When you run "helmfile test --cleanup" and the test fails - guess what - it deletes the pod so you can't look into why it failed.

I added --cleanup nevertheless for now. Maybe the logs can be grabbed from logstash then.

Details

	Subject	Repo	Branch	Lines +/-
	Workaround `helm3 test --logs` bug by omitting logs	integration/pipelinelib	master	+3 -3
	charts: remove depricated helm test annotation, fix hook-delete-policy	operations/deployment-charts	master	+81 -57

Customize query in gerrit

Related Objects

Mentioned In: T251305: Migrate to helm v3
T276550: Missing alerts for Termbox staging and test services
Mentioned Here: T251305: Migrate to helm v3

Event Timeline

JMeybohm triaged this task as Medium priority.Mar 9 2021, 3:45 PM

JMeybohm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2021, 3:45 PM

JMeybohm mentioned this in T276550: Missing alerts for Termbox staging and test services.Mar 9 2021, 4:15 PM

Just gonna leave a note here that the command in the task description is seemingly outdated, and you shouldn’t try to run it manually (like I did):

lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox (master $ u=) $ helmfile -e staging -l name=staging test --cleanup                                                                                                                                                                                                     
warn: requested cleanup will not be applied. To clean up test resources with Helm 3, you have to remove them manually or set helm.sh/hook-delete-policy

(The test also ultimately failed, but it’s not clear to me if that’s related to the Helm 3 migration as well, or something else. But at least the --cleanup no longer seems to do what it once did.)

JMeybohm mentioned this in T251305: Migrate to helm v3.Dec 7 2021, 3:09 PM

I think the behavior changed quite a bit between helm2 and helm3. So the described behavior above is a bit outdated.

Deletion of test resources in helm3 can be controlled using the "helm.sh/hook-delete-policy" annotation.

By default helm3 uses the "helm.sh/hook-delete-policy": before-hook-creation. Which is described as (docs):

Delete the previous resource before a new hook is launched (default)

So with default settings, test resources stay in the cluster until a new test is executed. Test resources from old tests don't block new tests, because this resources will get removed before new tests are executed.

That's not too bad, but it would be nice if test resources are removed if the test was successful (so the cluster is not cluttered with test resources). Luckily helm3 also has a annotation for that: "helm.sh/hook-delete-policy": hook-succeeded. Which is described as (docs)

Delete the resource after the hook (test) is successfully executed

So I would propose to add the following two annotations to every test we have:

annotations:
  "helm.sh/hook": test # test-success annotation is depricated
  "helm.sh/hook-delete-policy": hook-succeeded # cleanup test resources if test was successful

And then use the following hook in helmfiles:

- name: "test"
  events: ["postsync"]
  command: "helmfile"
  args: ["-e", "{{ .Environment.Name }}", "-l", "name={{`{{.Release.Name}}`}}", "test"]

Change 757877 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] charts: remove depricated helm test annotation, fix hook-delete-policy

https://gerrit.wikimedia.org/r/757877

gerritbot added a project: Patch-For-Review.Jan 28 2022, 12:50 PM

@JMeybohm mentioned to me that it is possible to combine multiple "helm.sh/hook-delete-policy".

So the proposed annotations in T276949#7655743 can be enhanced by also using the before-hook-creation policy. So the complete annotation is:

annotations:
  "helm.sh/hook": test
  "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded

By combining before-hook-creation and hook-succeeded successful tests are cleaned up after the test execution. Failed test pods will remain in the cluster for further investigation. However failed test pods don't block the execution of new tests with a duplicate resource error, because they are cleaned up before the next test execution.

I implemented this annotation change while removing deprecated helm2 annotations in https://gerrit.wikimedia.org/r/757877. This change does not add automatic testing and hooks to helmfiles, which is somewhat independent from the helm3 migration task T251305.

Change 757877 merged by jenkins-bot:

[operations/deployment-charts@master] charts: remove depricated helm test annotation, fix hook-delete-policy

https://gerrit.wikimedia.org/r/757877

Maintenance_bot removed a project: Patch-For-Review.Jan 28 2022, 2:11 PM

In T276949#7659528, @Jelto wrote:

I implemented this annotation change while removing deprecated helm2 annotations in https://gerrit.wikimedia.org/r/757877. This change does not add automatic testing and hooks to helmfiles, which is somewhat independent from the helm3 migration task T251305.

It seems this may have caused a regression in our PipelineLib based service pipelines. See a recent build of blubber-pipeline-rehearse Jenkins build for the error "Error: unable to get pod logs for blubberoid-qkhgqbib-service-checker: pods "blubberoid-qkhgqbib-service-checker" not found". There's also seemingly a relevant bug in Helm 3.

The workaround for now may be to remove the --logs option from our invocation of helm test. I'm not sure yet if there's a better solution or workaround. We'll have to investigate further. (cc @jeena)

bd808 subscribed.Jan 31 2022, 10:36 PM

Change 758571 had a related patch set uploaded (by Dduvall; author: Dduvall):

[integration/pipelinelib@master] Workaround `helm3 test --logs` bug by omitting logs

https://gerrit.wikimedia.org/r/758571

gerritbot added a project: Patch-For-Review.Jan 31 2022, 10:41 PM

Change 758571 merged by jenkins-bot:

[integration/pipelinelib@master] Workaround `helm3 test --logs` bug by omitting logs

https://gerrit.wikimedia.org/r/758571

Maintenance_bot removed a project: Patch-For-Review.Jan 31 2022, 11:10 PM

dancy updated the task description. (Show Details)Jun 15 2022, 2:51 PM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:23 PM