Page MenuHomePhabricator

All started jobs failed on Kubernetes during 24h with no visible error or output
Closed, ResolvedPublicBUG REPORT

Description

I was asked to document the issue here by arturo in telegram

the problems started after a very long running swepub extraction job got killed

bild.png (495×732 px, 235 KB)

List of steps to reproduce (step by step, including full links if applicable):

  • log in to the itemsubjector tool
  • try starting any job in a tf-python39 container

What happens?:

  • the job immediately get status Failed in the output of $toolforge-jobs list

~24 hours after the problem started it went away and jan 12 UTC 10:00 I started 2 jobs successfully that are still running as I write this:

tools.itemsubjector@tools-sgebastion-11:~$ toolforge-jobs list
Job name: Command: Job type: Container: File log: Emails: Resources: Status:


job77 ~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r normal tf-python39 yes none default Running
swepub-url-job5 ~/setup-swepub.sh && python3 ~/WikidataMLSuggester/extract-doi-url-from-swepub.py normal tf-python39 yes none default Running

What should have happened instead?:

  • the jobs should have started or kubernetes should at least given me some output about what went wrong as usual

$kubectl get events output is here:

tools.itemsubjector@tools-sgebastion-11:~$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
2m36s Normal Scheduled pod/job76-fvdrs Successfully assigned tool-itemsubjector/job76-fvdrs to tools-k8s-worker-77
2m36s Normal Pulling pod/job76-fvdrs Pulling image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest"
2m36s Normal Pulled pod/job76-fvdrs Successfully pulled image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest" in 99.701796ms
2m36s Normal Created pod/job76-fvdrs Created container job76
2m35s Normal Started pod/job76-fvdrs Started container job76
2m29s Normal Scheduled pod/job76-xqgpt Successfully assigned tool-itemsubjector/job76-xqgpt to tools-k8s-worker-77
2m29s Normal Pulling pod/job76-xqgpt Pulling image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest"
2m29s Normal Pulled pod/job76-xqgpt Successfully pulled image "docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest" in 89.576101ms
2m29s Normal Created pod/job76-xqgpt Created container job76
2m28s Normal Started pod/job76-xqgpt Started container job76
2m37s Normal SuccessfulCreate job/job76 Created pod: job76-fvdrs
2m30s Normal SuccessfulCreate job/job76 Created pod: job76-xqgpt
2m20s Warning BackoffLimitExceeded job/job76 Job has reached the specified backoff limit

Event Timeline

could you please paste here the concrete toolforge-jobs command line you are using to create the job?

could you please paste here the concrete toolforge-jobs command line you are using to create the job?

I run a wrapper script to create a new job and inspect the logs immediately after.
The script is here: https://github.com/dpriskorn/ItemSubjector/blob/master/create_kubernettes_job_and_watch_the_log.sh and invokes first a setup script (that sets up the python environment correctly) and then the itemsubjector script itself and then shows the logs using watch.

All code in there is what I run, so you can clone and test it yourself (see the https://github.com/dpriskorn/ItemSubjector/blob/master/Kubernetes_HOWTO.md for how I set it up)

As a first test, I would try wrapping the whole command:

From: --command "~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"
To: --command "~/mycommand.sh" where mycommand.sh contains ~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"

If you don't want to create the wrapper, try something like:

--command "/bin/sh -c -- '~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r'"

Let's see if that makes any difference.

aborrero triaged this task as Medium priority.Jan 13 2022, 11:25 AM
So9q claimed this task.

It now happens again.

As a first test, I would try wrapping the whole command:

From: --command "~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"
To: --command "~/mycommand.sh" where mycommand.sh contains ~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"

If you don't want to create the wrapper, try something like:

--command "/bin/sh -c -- '~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r'"

Let's see if that makes any difference.

It now happens again. I tested your command and got this output:

tools.itemsubjector@tools-sgebastion-11:~/itemsubjector$ less ../job116.*
  Running command git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-e8o5ih20/wikibaseintegrator_728b3c0d1e3b474b9f15e676bf978aca
  fatal: remote error:
    The unauthenticated git protocol on port 9418 is no longer supported.
  Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/LeMyst/WikibaseIntegrator@v0.12.0.dev5#egg=wikibaseintegrator. Command errored out with exit status 128: git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-e8o5ih20/wikibaseintegrator_728b3c0d1e3b474b9f15e676bf978aca Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement wikibaseintegrator (unavailable)
ERROR: No matching distribution found for wikibaseintegrator (unavailable)
  Running command git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-dctoaxdd/wikibaseintegrator_82363d8264d0473388254ca8bf6399e6
  fatal: remote error:
    The unauthenticated git protocol on port 9418 is no longer supported.
  Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/LeMyst/WikibaseIntegrator@v0.12.0.dev5#egg=wikibaseintegrator. Command errored out with exit status 128: git clone -q git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-dctoaxdd/wikibaseintegrator_82363d8264d0473388254ca8bf6399e6 Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement wikibaseintegrator (unavailable)
ERROR: No matching distribution found for wikibaseintegrator (unavailable)

Now I know exactly what the bug is this time. Thanks!
I'm closing this as resolved as I cannot reproduce the error above.