toolforge: kubernetes fails to handle some pods that are being mutated by our admission controllers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jul 12 2024, 11:07 AM

Description

During the Toolsbeta kubernetes from 1.24 to 1.25 we found a pod stuck in Terminating state. The Pod could not be killed manually either.

This Pod was created by the functional test suite of toolforge-deploy, as a a one-off job:

toolsbeta.test@toolsbeta-bastion-6:~$ toolforge jobs list
+------------+-----------+---------+
| Job name:  | Job type: | Status: |
+------------+-----------+---------+
| test-24344 |  one-off  | Unknown |
+------------+-----------+---------+
toolsbeta.test@toolsbeta-bastion-6:~$ toolforge jobs show test-24344
+---------------+----------------------------------------------------------------------+
| Job name:     | test-24344                                                           |
+---------------+----------------------------------------------------------------------+
| Command:      | echo 'test-24344'                                                    |
+---------------+----------------------------------------------------------------------+
| Job type:     | one-off                                                              |
+---------------+----------------------------------------------------------------------+
| Image:        | python3.11                                                           |
+---------------+----------------------------------------------------------------------+
| Port:         | none                                                                 |
+---------------+----------------------------------------------------------------------+
| File log:     | yes                                                                  |
+---------------+----------------------------------------------------------------------+
| Output log:   | /data/project/test/test-24344.out                                    |
+---------------+----------------------------------------------------------------------+
| Error log:    | /data/project/test/test-24344.err                                    |
+---------------+----------------------------------------------------------------------+
| Emails:       | none                                                                 |
+---------------+----------------------------------------------------------------------+
| Resources:    | default                                                              |
+---------------+----------------------------------------------------------------------+
| Mounts:       | all                                                                  |
+---------------+----------------------------------------------------------------------+
| Retry:        | no                                                                   |
+---------------+----------------------------------------------------------------------+
| Health check: | none                                                                 |
+---------------+----------------------------------------------------------------------+
| Status:       | Unknown                                                              |
+---------------+----------------------------------------------------------------------+
| Hints:        | Last run at 2024-07-12T09:51:52Z. Pod in 'Succeeded' phase. State    |
|               | 'terminated'. Reason 'Completed'. Started at '2024-07-12T09:52:00Z'. |
|               | Finished at '2024-07-12T09:52:00Z'. Exit code '0'.                   |
+---------------+----------------------------------------------------------------------+

Inspecting the system, showed some errors:

aborrero@toolsbeta-test-k8s-control-8:~$ sudo -i kubectl -n kube-system logs kube-controller-manager-toolsbeta-test-k8s-control-8
[..]
E0712 10:33:43.963305       1 garbagecollector.go:720] orphanDependents for [batch/v1/Job, namespace: tool-test, name: test-24344, uid: 5b25feb4-fe18-4788-92f8-af3198f2a9b8] failed with failed to orphan dependents of owner [batch/v1/Job, namespace: tool-test, name: test-24344, uid: 5b25feb4-fe18-4788-92f8-af3198f2a9b8], got errors: orphaning [v1/Pod, namespace: tool-test, name: test-24344-snm8n, uid: 4ce279c6-b4d8-44c3-aefe-386c1f84ad8b] failed, Pod "test-24344-snm8n" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
  	Volumes:        {{Name: "kube-api-access-29dw5", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{ExpirationSeconds: 3607, Path: "token"}}, {ConfigMap: &{LocalObjectReference: {Name: "kube-root-ca.crt"}, Items: {{Key: "ca.crt", Path: "ca.crt"}}}}, {DownwardAPI: &{Items: {{Path: "namespace", FieldRef: &{APIVersion: "v1", FieldPath: "metadata.namespace"}}}}}}, DefaultMode: &420}}}, {Name: "dumps", VolumeSource: {HostPath: &{Path: "/public/dumps", Type: &"Directory"}}}, {Name: "dumpsrc-clouddumps1001", VolumeSource: {HostPath: &{Path: "/mnt/nfs/dumps-clouddumps1001.wikimedia.org", Type: &"Directory"}}}, {Name: "dumpsrc-clouddumps1002", VolumeSource: {HostPath: &{Path: "/mnt/nfs/dumps-clouddumps1002.wikimedia.org", Type: &"Directory"}}}, ...},
  	InitContainers: nil,
  	Containers: []core.Container{
  		{
  			... // 5 identical fields
  			Ports:   nil,
  			EnvFrom: nil,
  			Env: []core.EnvVar{
  				{Name: "HOME", Value: "/data/project/test"},
  				{Name: "TOOL_DATA_DIR", Value: "/data/project/test"},
+ 				{
+ 					Name: "RAYMOND_ENVVAR",
+ 					ValueFrom: &core.EnvVarSource{
+ 						SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "RAYMOND_ENVVAR"},
+ 					},
+ 				},
+ 				{
+ 					Name: "SOMEVAR",
+ 					ValueFrom: &core.EnvVarSource{
+ 						SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "SOMEVAR"},
+ 					},
+ 				},
+ 				{
+ 					Name: "TEST3",
+ 					ValueFrom: &core.EnvVarSource{
+ 						SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "TEST3"},
+ 					},
+ 				},
+ 				{
+ 					Name: "TEST4",
+ 					ValueFrom: &core.EnvVarSource{
+ 						SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "TEST4"},
+ 					},
+ 				},
  			},
  			Resources:    {Limits: {s"cpu": {i: {...}, s: "500m", Format: "DecimalSI"}, s"memory": {i: {...}, Format: "BinarySI"}}, Requests: {s"cpu": {i: {...}, s: "250m", Format: "DecimalSI"}, s"memory": {i: {...}, Format: "BinarySI"}}},
  			VolumeMounts: {{Name: "kube-api-access-29dw5", ReadOnly: true, MountPath: "/var/run/secrets/kubernetes.io/serviceaccount"}, {Name: "dumps", ReadOnly: true, MountPath: "/public/dumps"}, {Name: "dumpsrc-clouddumps1001", ReadOnly: true, MountPath: "/mnt/nfs/dumps-clouddumps1001.wikimedia.org"}, {Name: "dumpsrc-clouddumps1002", ReadOnly: true, MountPath: "/mnt/nfs/dumps-clouddumps1002.wikimedia.org"}, ...},
  			... // 12 identical fields
  		},
  	},
  	EphemeralContainers: nil,
  	RestartPolicy:       "Never",
  	... // 26 identical fields
  }

There were a few envvars present:

toolsbeta.test@toolsbeta-bastion-6:~$ toolforge envvars list
name            value
RAYMOND_ENVVAR  raymond_envvar
SOMEVAR         mynameisvar
TEST3           something new
TEST4           osnteu

When the envvars were deleted, the system was finally able to cleanup the pod, which was orphan in the sense that the corresponding Job did not exist (was already deleted).

I believe there is a problem with the envvars-admission controller, which tries to modify immutable Pod fields.

Event Timeline

aborrero created this task.Jul 12 2024, 11:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2024, 11:07 AM

aborrero changed the task status from Open to In Progress.Jul 12 2024, 11:08 AM

aborrero triaged this task as High priority.

potential fixes:

aborrero moved this task from Backlog to Doing on the User-aborrero board.Jul 12 2024, 1:06 PM

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/13

webhook: don't mutate pods on UPDATE operations

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/414

volume-admission: bump to 0.0.51-20240715075554-8f9d4061

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/414

volume-admission: bump to 0.0.51-20240715075554-8f9d4061

There's still the question of what is suddenly updating pods?

If it's kyverno:
- How come that we suddenly generate pods that need updating?
- Which component is generating those malformed pods? (seems that pod was a job from the jobs-api)
If it's something else:
- How come this only happened now?

In T369890#9980270, @dcaro wrote:

There's still the question of what is suddenly updating pods?

If it's kyverno:

How come that we suddenly generate pods that need updating?

Which component is generating those malformed pods? (seems that pod was a job from the jobs-api)

If it's something else:

How come this only happened now?

I don't think Kyverno is involved at all here.

My current theory is this:

we detected this during the kubernetes upgrade from 1.24 to 1.25
per the error entry in the log: garbagecollector.go:720] orphanDependents for [batch/v1/Job, this happened because we deleted a jobs-api one-off Job definition, and the system (k8s) wanted to remove the orphan pod associated with it.
the newer kubernetes version somehow has additional steps when running the garbage collector, which includes some UPDATE operation on orphaned pods
this results in the UPDATE operation being evaluated by the admission controller, something that did not happen on the previous k8s version

the newer kubernetes version somehow has additional steps when running the garbage collector, which includes some UPDATE operation on orphaned pods

If it's a new kubernetes GC step, why was this not found on lima-kilo?
Or on other toolsbeta functional test runs?

In T369890#9980307, @dcaro wrote:

the newer kubernetes version somehow has additional steps when running the garbage collector, which includes some UPDATE operation on orphaned pods

If it's a new kubernetes GC step, why was this not found on lima-kilo?
Or on other toolsbeta functional test runs?

In lima-kilo we don't upgrade in place using kubeadm. We rebuild with the new version. There is no transition of workloads.

On the other hand, when we did the toolsbeta upgrade, I had the functional tests running on a loop when running kubeadm.

The theory of new Job/Pod cleanup interactions is supported by at least these 2 upstream changes, included in the 1.25 version:

https://github.com/kubernetes/kubernetes/pull/110959 Append new pod conditions when deleting pods to indicate the reason for pod deletion
https://github.com/kubernetes/kubernetes/pull/119164 (https://github.com/kubernetes/kubernetes/pull/119159) Only declare job as finished after removing all finalizers

Reference: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.25.md (search for orphan keyword)

In lima-kilo we don't upgrade in place using kubeadm. We rebuild with the new version. There is no transition of workloads.

I don't see why the transition of workloads has nothing to do there, the changes seem to apply to all workloads not only during upgrade.

https://github.com/kubernetes/kubernetes/pull/110959 Append new pod conditions when deleting pods to indicate the reason for pod deletion

I think this change has the key:

Introduction of the `DisruptionTarget` pod condition type. Its `reason` field indicates the reason for pod termination:
- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod deleted by taint manager due to NoExecute taint)
- EvictionByEvictionAPI (Pod evicted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by PodGC)

Would be nice to verify though if it happens on manual deletion too (maybe only when deleting the deployment but not the pod? Was that part of the upgrade at any point?), but for sure it seems to happen on eviction from a node, so the moment that the pod needs to be moved to another node, this will be trigger, and we do those only on toolsbeta and not on lima-kilo for now.

aborrero updated https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/8

webhook: don't mutate pods on UPDATE operations

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/8

webhook: don't mutate pods on UPDATE operations

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/419

envvars-admission: bump to 0.0.14-20240716084546-0b645f15

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/419

envvars-admission: bump to 0.0.14-20240716084546-0b645f15

aborrero closed this task as Resolved.Jul 16 2024, 8:59 AM

aborrero claimed this task.

dcaro moved this task from Next Up to Done on the Toolforge (Toolforge iteration 12) board.Jul 16 2024, 1:20 PM