In T379039#10294894, @bd808 wrote:In T379039#10294507, @Pppery wrote:https://gitlab.wikimedia.org/rlopez-wmf still shows as blocked. I wonder if the account approval bot is down.
(I was waiting for the account to actually be approved before closing)
$ become gitlab-account-approval $ toolforge jobs list +-----------+-----------------------+------------------------------------------+ | Job name: | Job type: | Status: | +-----------+-----------------------+------------------------------------------+ | approve | schedule: */3 * * * * | Running for 1d6h34m | | logrotate | schedule: @daily | Last schedule time: 2024-11-05T17:12:00Z | +-----------+-----------------------+------------------------------------------+The job got stuck. I killed the stuck job, and the next run caught up.
Description
Description
Related Objects
Related Objects
- Mentioned In
- T379139: [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state
T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`)
T379132: chie-bot: Jobs hang on toolforge - Mentioned Here
- T306391: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime
T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`)
T379039: Requesting GitLab account activation for [Ramon Lopez]
Event Timeline
Comment Actions
T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`) would be a potential solution for this situation, but there are other things that can be done without platform support for timeouts or replacement.
Comment Actions
I killed the stuck job
Can you elaborate a bit on how did you kill the stuck job?
(that might help pinpoint the underlying issue, and help trim down the list of potential solutions)
Comment Actions
:/, adding a silly livenessProbe like 'echo "I'm alive"' does not help, as the container is actually able to execute that without issues (in this case, where NFS is stuck), that means that for the health probe to be effective, it has to be a bit smarter than that and test NFS, or whatever might be that makes the process stuck
Let me see using the replace strategy and wrapping the run with timeout, see if those are able to kill the stuck pod (I'm guessing that timeout might not work).
Comment Actions
Results are:
- concurrencyPolicy: Replace seems to be the only one able to get NFS stuck jobs to retrigger (potentially in a different worker).
- wrapping your command with timeout <N> <command> does not, as timeout is unable to kill/stop processes in D state (it's also tricky to put it in the right place in the script)
- adding a liveness probe gives mixed results
- A simple liveness probe (like trying to run something in the pod), will not catch the NFS issue
- Something a bit more complex like reading the logfile does not catch the NFS issue (as it can read it, the issue seems to be when writing)
- You'd need something in the lines of checking the age of the logfile, or the last entry, or creating a "heartbeat" file from the process and checking it's existance + deleting it from the liveness probe or similar
So yep, besides preventing this from happening (the NFS stuff, long running issue), I think that the change in concurrencyPolicy will help people get unstuck and be able to run their jobs sometimes, with the note that if their job is meant to run for longer than one period (as some are), they will never finish.
Comment Actions
I thought about kubectl delete pod $POD but went with toolforge jobs delete approve; toolforge jobs load jobs.yaml.
The job is configured to log to disk where there had been no activity since 2024-11-04T17:34:04Z. Log snippet for timestamps:
2024-11-04T17:33:56Z glaab.utils INFO: Checking mgagat 2024-11-04T17:34:02Z glaab.utils INFO: Checking okeamah 2024-11-04T17:34:03Z glaab.utils INFO: Checking xuhao61 2024-11-04T17:34:04Z glaab.utils INFO: Checking cybel 2024-11-06T00:18:08Z glaab.utils INFO: Checking edriiic 2024-11-06T00:18:09Z glaab.utils INFO: Checking geppy 2024-11-06T00:18:10Z glaab.utils INFO: Checking funa-enpitu 2024-11-06T00:18:10Z glaab.utils INFO: Checking nfontes
I did not attempt to capture where the job was running when it got stuck unfortunately, so there probably is not a lot to learn here other than that the issue happened.
Comment Actions
I did not attempt to capture where the job was running when it got stuck unfortunately, so there probably is not a lot to learn here other than that the issue happened.
Thanks for elaborating, this sounds like stuck on NFS yep, a simple kill would probably not have helped (ex. using timeout wrapper), it would have to be handled by k8s moving it to a different node. That makes me wonder, do liveness probe failures move the pod to a different node?
Comment Actions
Yep, NFS/kernel/not sure what has changed in it's behavior since the first issues, and now k8s is able to delete the pods that are stuck by it and start them in another place, I'm reprioritizing this, probably will try to get one of the options done next round.
Comment Actions
dcaro triaged this task as High priority.
Setting this to the normal "medium" priority. Note this task is specific to the Tool-gitlab-account-approval tool and not a general "make things better in Toolforge" feature request/bug report.
Comment Actions
Setting this to the normal "medium" priority. Note this task is specific to the Tool-gitlab-account-approval tool and not a general "make things better in Toolforge" feature request/bug report.
oh, yep, sorry about that
Comment Actions
This seems to have been fixed by T306391: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime and timeout: 150 in the job specification.