Page MenuHomePhabricator

Issue creating pods after migration away from PodPresets
Closed, ResolvedPublic

Description

Following the migration of Toolforge away from PodPresets, I’ve had an issue in the notwikilambda tool where the function-orchestrator deployment couldn’t create new pods (or new replicasets?). The error message was:

8m59s Warning FailedCreate replicaset/function-orchestrator-6f886d59d Error creating: Internal error occurred: add operation does not apply: doc is missing path: "/spec/volumes/-": missing value

You can see the deployment.yaml at the time on GitHub; I worked around it by removing “toolforge: tool” from it and instead mounting the function-orchestrator source manually (SAL).

Event Timeline

Note that I’m still using toolforge: tool in other Kubernetes objects, where it seems to work; in particular, the restart cronjob runs twice a day, successfully (kubectl get deployment function-evaluator -o json | jq -r '.spec.template.metadata.annotations["kubectl.kubernetes.io/restartedAt"]' shows that the function-evaluator deployment was actually restarted), and to do so it must have the home directory mounted (otherwise kubectl would be missing its credentials).

(The other toolforge: tool object I have is the update deployment, but that one has been running since before PodPresets were phased out, if I’m not mistaken, so that doesn’t necessarily prove anything.)

On both pods for the controller I see on of these on different dates: 2021/09/30 17:22:52 http: TLS handshake error from 192.168.48.128:39842: EOF, but it seems to be functioning for the most part, so I'm not sure what that's about. That's not currently the pod's IP address, and I don't even see that IP in the current environment, so I presume that's just some old stuff.

I was able to mount things when I ran a quick "webservice shell" to simulate using the preset. The "missing path" error sounds like it is trying to patch the wrong object almost, but the definition doesn't look wrong. I don't see the error just yet. I'll poke a little more.

I see your problem with the new setup:

automountServiceAccountToken: false

By not mounting the SA token, there's no "volumes" item in your pod. If there's no volumes item, you cannot do an "add" operation, which is what the mutating webhook controller does.

For now, I am willing to bet, you can just remove that line, and your tool will work via the label again. We also should probably upgrade the web hook so that it can function without a pre-existing volumes list as well.

Very similar to what people experienced here https://github.com/evanphx/json-patch/issues/138. The json representation of the structure of the pod object is not fixed in k8s, so if there is no volume at all on the pod, this mutator fails because it is using json-patch as it's strategy. Since the default service account mounts its own creds in most pods, this isn't a problem unless you create your own object that disables that feature.

I see your problem with the new setup:

automountServiceAccountToken: false

By not mounting the SA token, there's no "volumes" item in your pod. If there's no volumes item, you cannot do an "add" operation, which is what the mutating webhook controller does.

https://github.com/wikimedia/cloud-toolforge-volume-admission-controller/blob/0a2d76cf09541767df7d675cf7d7e8a88c885e6b/server/admission.go#L148
I was curious how the admission controller was adding volumes to a non-existent array. So the SA automount is the answer.

I've got a patch set that I'm going to test in Minikube in a few.

Change 726997 had a related patch set uploaded (by Bstorm; author: Bstorm):

[cloud/toolforge/volume-admission-controller@main] json-patch: create empty arrays where needed

https://gerrit.wikimedia.org/r/726997

Bstorm changed the task status from Open to In Progress.Oct 7 2021, 12:47 AM
Bstorm claimed this task.
Bstorm triaged this task as Medium priority.
Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 726997 merged by jenkins-bot:

[cloud/toolforge/volume-admission-controller@main] json-patch: create empty arrays where needed

https://gerrit.wikimedia.org/r/726997

Fix tested in production.