Steps to replicate the issue (include links if applicable):
- clone the jobs-framework-api repository (https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api) and follow the instructions to set it up on your local machine.
- submit a job job1.sh that echoes "delaying by 1hr" to stdout and then sleep 3600 (ensure that filelog is enabled)
- submit a job job_does_not_exist.sh where the given script file doesn't exist.
What happens?:
- job1.sh job doesn't produce any log and with status being reported as "Completed". This is a pointer that the script wasn't executed at all. If the script was executed then it should have logged our desired string to .out file and should have waited for 1hr (if k8s allows) before the job terminates, but it terminates almost immediately.
- job_does_not_exist.sh is also reported as "Completed" even though this job should clearly fail because the script doesn't exist.
What should have happened instead?:
- job1.sh job should log to .out file then wait for as long as 1hr before the job pod is marked as "Completed" and terminated.
- show command for job_does_not_exist.sh job should have status that reports that the job has error and failed.
Why was this not detected by the tests
because we weren't testing for this. Right now in the tests what we do is to submit jobs then read the k8s object and inspect it. The problem is that this approach is not always reliable and while it eventually becomes reliable, it doesn't always report the correct information especially when you try to inspect the k8s object soon after the job is created.
Solution
- revert or improve patch https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/868076
- fix the tests to log to stdout and verify that the job was successful by reading the stdout log. This is in contrast to the current approach of just inspecting the k8s object.