During the Toolsbeta kubernetes from 1.24 to 1.25 we found a pod stuck in Terminating state. The Pod could not be killed manually either.
This Pod was created by the functional test suite of toolforge-deploy, as a a one-off job:
toolsbeta.test@toolsbeta-bastion-6:~$ toolforge jobs list +------------+-----------+---------+ | Job name: | Job type: | Status: | +------------+-----------+---------+ | test-24344 | one-off | Unknown | +------------+-----------+---------+ toolsbeta.test@toolsbeta-bastion-6:~$ toolforge jobs show test-24344 +---------------+----------------------------------------------------------------------+ | Job name: | test-24344 | +---------------+----------------------------------------------------------------------+ | Command: | echo 'test-24344' | +---------------+----------------------------------------------------------------------+ | Job type: | one-off | +---------------+----------------------------------------------------------------------+ | Image: | python3.11 | +---------------+----------------------------------------------------------------------+ | Port: | none | +---------------+----------------------------------------------------------------------+ | File log: | yes | +---------------+----------------------------------------------------------------------+ | Output log: | /data/project/test/test-24344.out | +---------------+----------------------------------------------------------------------+ | Error log: | /data/project/test/test-24344.err | +---------------+----------------------------------------------------------------------+ | Emails: | none | +---------------+----------------------------------------------------------------------+ | Resources: | default | +---------------+----------------------------------------------------------------------+ | Mounts: | all | +---------------+----------------------------------------------------------------------+ | Retry: | no | +---------------+----------------------------------------------------------------------+ | Health check: | none | +---------------+----------------------------------------------------------------------+ | Status: | Unknown | +---------------+----------------------------------------------------------------------+ | Hints: | Last run at 2024-07-12T09:51:52Z. Pod in 'Succeeded' phase. State | | | 'terminated'. Reason 'Completed'. Started at '2024-07-12T09:52:00Z'. | | | Finished at '2024-07-12T09:52:00Z'. Exit code '0'. | +---------------+----------------------------------------------------------------------+
Inspecting the system, showed some errors:
aborrero@toolsbeta-test-k8s-control-8:~$ sudo -i kubectl -n kube-system logs kube-controller-manager-toolsbeta-test-k8s-control-8 [..] E0712 10:33:43.963305 1 garbagecollector.go:720] orphanDependents for [batch/v1/Job, namespace: tool-test, name: test-24344, uid: 5b25feb4-fe18-4788-92f8-af3198f2a9b8] failed with failed to orphan dependents of owner [batch/v1/Job, namespace: tool-test, name: test-24344, uid: 5b25feb4-fe18-4788-92f8-af3198f2a9b8], got errors: orphaning [v1/Pod, namespace: tool-test, name: test-24344-snm8n, uid: 4ce279c6-b4d8-44c3-aefe-386c1f84ad8b] failed, Pod "test-24344-snm8n" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative) core.PodSpec{ Volumes: {{Name: "kube-api-access-29dw5", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{ExpirationSeconds: 3607, Path: "token"}}, {ConfigMap: &{LocalObjectReference: {Name: "kube-root-ca.crt"}, Items: {{Key: "ca.crt", Path: "ca.crt"}}}}, {DownwardAPI: &{Items: {{Path: "namespace", FieldRef: &{APIVersion: "v1", FieldPath: "metadata.namespace"}}}}}}, DefaultMode: &420}}}, {Name: "dumps", VolumeSource: {HostPath: &{Path: "/public/dumps", Type: &"Directory"}}}, {Name: "dumpsrc-clouddumps1001", VolumeSource: {HostPath: &{Path: "/mnt/nfs/dumps-clouddumps1001.wikimedia.org", Type: &"Directory"}}}, {Name: "dumpsrc-clouddumps1002", VolumeSource: {HostPath: &{Path: "/mnt/nfs/dumps-clouddumps1002.wikimedia.org", Type: &"Directory"}}}, ...}, InitContainers: nil, Containers: []core.Container{ { ... // 5 identical fields Ports: nil, EnvFrom: nil, Env: []core.EnvVar{ {Name: "HOME", Value: "/data/project/test"}, {Name: "TOOL_DATA_DIR", Value: "/data/project/test"}, + { + Name: "RAYMOND_ENVVAR", + ValueFrom: &core.EnvVarSource{ + SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "RAYMOND_ENVVAR"}, + }, + }, + { + Name: "SOMEVAR", + ValueFrom: &core.EnvVarSource{ + SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "SOMEVAR"}, + }, + }, + { + Name: "TEST3", + ValueFrom: &core.EnvVarSource{ + SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "TEST3"}, + }, + }, + { + Name: "TEST4", + ValueFrom: &core.EnvVarSource{ + SecretKeyRef: &core.SecretKeySelector{LocalObjectReference: core.LocalObjectReference{...}, Key: "TEST4"}, + }, + }, }, Resources: {Limits: {s"cpu": {i: {...}, s: "500m", Format: "DecimalSI"}, s"memory": {i: {...}, Format: "BinarySI"}}, Requests: {s"cpu": {i: {...}, s: "250m", Format: "DecimalSI"}, s"memory": {i: {...}, Format: "BinarySI"}}}, VolumeMounts: {{Name: "kube-api-access-29dw5", ReadOnly: true, MountPath: "/var/run/secrets/kubernetes.io/serviceaccount"}, {Name: "dumps", ReadOnly: true, MountPath: "/public/dumps"}, {Name: "dumpsrc-clouddumps1001", ReadOnly: true, MountPath: "/mnt/nfs/dumps-clouddumps1001.wikimedia.org"}, {Name: "dumpsrc-clouddumps1002", ReadOnly: true, MountPath: "/mnt/nfs/dumps-clouddumps1002.wikimedia.org"}, ...}, ... // 12 identical fields }, }, EphemeralContainers: nil, RestartPolicy: "Never", ... // 26 identical fields }
There were a few envvars present:
toolsbeta.test@toolsbeta-bastion-6:~$ toolforge envvars list name value RAYMOND_ENVVAR raymond_envvar SOMEVAR mynameisvar TEST3 something new TEST4 osnteu
When the envvars were deleted, the system was finally able to cleanup the pod, which was orphan in the sense that the corresponding Job did not exist (was already deleted).
I believe there is a problem with the envvars-admission controller, which tries to modify immutable Pod fields.