Page MenuHomePhabricator

chie-bot: Jobs hang on toolforge
Closed, ResolvedPublicBUG REPORT

Description

My jobs started to occasionally hang on toolforge. I've created an empty test job to see if it's my tool's fault and looks like it's a general toolforge issue.

My tool's name: chie-bot. I've created a test job job-test with the following configuration:

- name: job-test
  command: ./job-test
  image: bookworm
  schedule: "*/5 * * * *"

the contents of ./job-test:

date

So the only thing job does is outputs the current datetime and terminates. This has been working fine until Mon Nov 4 05:30:17 PM UTC 2024. Now toolforge jobs list is telling me this:

+-----------------------+------------------------+------------------------------------------+
|       Job name:       |       Job type:        |                 Status:                  |
+-----------------------+------------------------+------------------------------------------+
|       job-test        | schedule: */5 * * * *  |            Running for 1d8h2m            |
+-----------------------+------------------------+------------------------------------------+

job-test.out file contains timetstamps as expected up until 05:30:17 PM. job-test.err is empty. Looking at Running for 1d8h2m suggests that the job failed to output the date in the last run.

T377420 is probably related

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This also coincides with the upgrade to 1.28, the job is still stuck, that will help us troubleshoot the root issue, looking.

Looking a bit, the job is running on tools-k8s-worker-nfs-24, that does not seem to be reporting stuck processes:

image.png (343×842 px, 56 KB)

But I'm timing out trying to ssh to it, so definitely having issues.

According to k8s everything is ok though, let me play a bit with healthchecks, try to see if that would help in this case.

Hmm... I do see a bunch of stuck processes (in D status), so there might be a reporting issue too :/

This also coincides with the upgrade to 1.28, the job is still stuck, that will help us troubleshoot the root issue, looking.

this started happening 3-4 weeks ago, if that helps

@Leloiandudu can you check that this is fixed for you?
This should have been temporarily fixed (the worker that was having issues was restarted, so the jobs got unblocked), there's still the point of adding a more long-term solution, but that can be done in one of the specific tasks for it.

I'm going to continue monitoring, thank you

dcaro triaged this task as High priority.Nov 7 2024, 4:42 PM
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.

I'm removing the cloud/toolforge tags as this is mostly for you @Leloiandudu to track your project (there's no chie-bot phabricator project, otherwise would have added it to it).
Added the tasks that would solve this permanently as subtasks.
Feel free to close this one (or keep it open for you to track). Will be keeping track of the features in the subtasks. Thanks for the report! (and the reproducer, really useful when debugging).

dcaro renamed this task from Jobs hang on toolforge to chie-bot: Jobs hang on toolforge.Nov 7 2024, 5:21 PM

I haven't seen any hanging jobs since Nov. We can consider this fixed

Leloiandudu claimed this task.