Page MenuHomePhabricator

jobs failing with exit code 127
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • run git checkout main on jobs-framework-api repo.
  • run git pull origin main to ensure the repo is up to date.
  • now try submitting a job for example toolforge-jobs --debug run anotherjob --command ./sleep40.sh --image tf-python39 (should already have the jobs-framework-cli already installed.

What happens?:
After submitting the job, running toolforge-jobs show anotherjob shows that the job failed with exist status 127. job logs are also not created.

What should have happened instead?:
the job should succeed and logs should be created.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Investigating this a bit seems to point to the bug being introduced by this patch https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/783439. Another person can confirm this claim.

Event Timeline

The patch https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/783439 makes 100% sense. We don't have that admission controller in our development cluster :-(

One option is that declare that the volume admission controller is a dependency of the devel setup and include instructions on how to deploy it locally. It shouldn't be very difficult, just another git clone + deploy.sh mostly.

https://gerrit.wikimedia.org/r/admin/repos/cloud/toolforge/volume-admission-controller

But I just checked and it doesn't even have a deploy.sh script. So we should fix that before anything else.

Change 857599 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/volume-admission-controller@main] volume-admission-controller: add deploy.sh script

https://gerrit.wikimedia.org/r/857599

Change 857599 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/volume-admission-controller@main] volume-admission-controller: add deploy.sh script

https://gerrit.wikimedia.org/r/857599

update on this @aborrero , it seems like just installing the volume-admission-controller doesn't solve this. When I tried I noticed other problems.

  • first was the job pods being stuck in pending state with event logs that says:
LAST SEEN   TYPE      REASON             OBJECT                MESSAGE
51m         Warning   FailedScheduling   pod/normaljob-bm7cw   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
50m         Warning   FailedScheduling   pod/normaljob-bm7cw   skip schedule deleting pod: tool-test/normaljob-bm7cw
4m21s       Warning   FailedScheduling   pod/normaljob-xp5cc   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
54m         Normal    SuccessfulCreate   job/normaljob         Created pod: normaljob-bm7cw
45m         Normal    SuccessfulCreate   job/normaljob         Created pod: normaljob-xp5cc
  • @taavi suggested adding the label kubectl label nodes kind-control-plane kubernetes.wmcloud.org/nfs-mounted=true to the node and after that, the problem graduated to the job pod getting stuck in ContainerCreating state and event logs being:
LAST SEEN   TYPE      REASON             OBJECT                MESSAGE
12m         Normal    Scheduled          pod/normaljob-l2psx   Successfully assigned tool-test/normaljob-l2psx to kind-control-plane
12m         Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "dumps" : hostPath type check failed: /public/dumps is not a directory
2m11s       Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "etcldap-yaml" : hostPath type check failed: /etc/ldap.yaml is not a file
12m         Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "scratch" : hostPath type check failed: /data/scratch is not a directory
12m         Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "etcldap-conf" : hostPath type check failed: /etc/ldap.conf is not a file
12m         Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "etcnovaobserver-yaml" : hostPath type check failed: /etc/novaobserver.yaml is not a file
12m         Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "dumpsrc-clouddumps1001" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1001.wikimedia.org is not a directory
12m         Warning   FailedMount        pod/normaljob-l2psx   MountVolume.SetUp failed for volume "dumpsrc-clouddumps1002" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1002.wikimedia.org is not a directory
17m         Normal    Scheduled          pod/normaljob-pzh2r   Successfully assigned tool-test/normaljob-pzh2r to kind-control-plane
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "dumpsrc-clouddumps1001" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1001.wikimedia.org is not a directory
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "scratch" : hostPath type check failed: /data/scratch is not a directory
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "dumpsrc-clouddumps1002" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1002.wikimedia.org is not a directory
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "etcldap-conf" : hostPath type check failed: /etc/ldap.conf is not a file
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "etcnovaobserver-yaml" : hostPath type check failed: /etc/novaobserver.yaml is not a file
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "dumps" : hostPath type check failed: /public/dumps is not a directory
17m         Warning   FailedMount        pod/normaljob-pzh2r   MountVolume.SetUp failed for volume "etcldap-yaml" : hostPath type check failed: /etc/ldap.yaml is not a file
20m         Warning   FailedScheduling   pod/normaljob-xp5cc   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "etcldap-conf" : hostPath type check failed: /etc/ldap.conf is not a file
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "etcnovaobserver-yaml" : hostPath type check failed: /etc/novaobserver.yaml is not a file
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "dumps" : hostPath type check failed: /public/dumps is not a directory
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "scratch" : hostPath type check failed: /data/scratch is not a directory
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "etcldap-yaml" : hostPath type check failed: /etc/ldap.yaml is not a file
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "dumpsrc-clouddumps1001" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1001.wikimedia.org is not a directory
18m         Warning   FailedMount        pod/normaljob-xp5cc   MountVolume.SetUp failed for volume "dumpsrc-clouddumps1002" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1002.wikimedia.org is not a directory
17m         Normal    SuccessfulCreate   job/normaljob         Created pod: normaljob-pzh2r
12m         Normal    SuccessfulCreate   job/normaljob         Created pod: normaljob-l2psx

Change 861903 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] devel/: refresh instructions and setup.sh script

https://gerrit.wikimedia.org/r/861903

update on this @aborrero , it seems like just installing the volume-admission-controller doesn't solve this. When I tried I noticed other problems.

Thanks for the update!

We may need to create more directories by hand (and configuring them in kind/minikube as hostmounts). Check the devel/setup.sh in this patch:

https://gerrit.wikimedia.org/r/861903

Change 861903 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] devel/: refresh instructions and setup.sh script

https://gerrit.wikimedia.org/r/861903