I was asked to document the issue here by arturo in telegram
the problems started after a very long running swepub extraction job got killed
List of steps to reproduce (step by step, including full links if applicable):
- log in to the itemsubjector tool
- try starting any job in a tf-python39 container
What happens?:
- the job immediately get status Failed in the output of $toolforge-jobs list
~24 hours after the problem started it went away and jan 12 UTC 10:00 I started 2 jobs successfully that are still running as I write this:
tools.itemsubjector@tools-sgebastion-11:~$ toolforge-jobs list
Job name: Command: Job type: Container: File log: Emails: Resources: Status:
job77 ~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r normal tf-python39 yes none default Running
swepub-url-job5 ~/setup-swepub.sh && python3 ~/WikidataMLSuggester/extract-doi-url-from-swepub.py normal tf-python39 yes none default Running
What should have happened instead?:
- the jobs should have started or kubernetes should at least given me some output about what went wrong as usual
$kubectl get events output is here:
tools.itemsubjector@tools-sgebastion-11:~$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
2m36s Normal Scheduled pod/job76-fvdrs Successfully assigned tool-itemsubjector/job76-fvdrs to tools-k8s-worker-77
2m36s Normal Pulling pod/job76-fvdrs Pulling image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest"
2m36s Normal Pulled pod/job76-fvdrs Successfully pulled image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest" in 99.701796ms
2m36s Normal Created pod/job76-fvdrs Created container job76
2m35s Normal Started pod/job76-fvdrs Started container job76
2m29s Normal Scheduled pod/job76-xqgpt Successfully assigned tool-itemsubjector/job76-xqgpt to tools-k8s-worker-77
2m29s Normal Pulling pod/job76-xqgpt Pulling image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest"
2m29s Normal Pulled pod/job76-xqgpt Successfully pulled image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest" in 89.576101ms
2m29s Normal Created pod/job76-xqgpt Created container job76
2m28s Normal Started pod/job76-xqgpt Started container job76
2m37s Normal SuccessfulCreate job/job76 Created pod: job76-fvdrs
2m30s Normal SuccessfulCreate job/job76 Created pod: job76-xqgpt
2m20s Warning BackoffLimitExceeded job/job76 Job has reached the specified backoff limit