Page MenuHomePhabricator

Approval job can get stuck and prevent subsequent jobs from firing
Closed, ResolvedPublicBUG REPORT

Description

https://gitlab.wikimedia.org/rlopez-wmf still shows as blocked. I wonder if the account approval bot is down.

(I was waiting for the account to actually be approved before closing)

$ become gitlab-account-approval
$ toolforge jobs list
+-----------+-----------------------+------------------------------------------+
| Job name: |       Job type:       |                 Status:                  |
+-----------+-----------------------+------------------------------------------+
|  approve  | schedule: */3 * * * * |           Running for 1d6h34m            |
| logrotate |   schedule: @daily    | Last schedule time: 2024-11-05T17:12:00Z |
+-----------+-----------------------+------------------------------------------+

The job got stuck. I killed the stuck job, and the next run caught up.

Event Timeline

T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`) would be a potential solution for this situation, but there are other things that can be done without platform support for timeouts or replacement.

I killed the stuck job

Can you elaborate a bit on how did you kill the stuck job?
(that might help pinpoint the underlying issue, and help trim down the list of potential solutions)

:/, adding a silly livenessProbe like 'echo "I'm alive"' does not help, as the container is actually able to execute that without issues (in this case, where NFS is stuck), that means that for the health probe to be effective, it has to be a bit smarter than that and test NFS, or whatever might be that makes the process stuck

Let me see using the replace strategy and wrapping the run with timeout, see if those are able to kill the stuck pod (I'm guessing that timeout might not work).

Results are:

  • concurrencyPolicy: Replace seems to be the only one able to get NFS stuck jobs to retrigger (potentially in a different worker).
  • wrapping your command with timeout <N> <command> does not, as timeout is unable to kill/stop processes in D state (it's also tricky to put it in the right place in the script)
  • adding a liveness probe gives mixed results
    • A simple liveness probe (like trying to run something in the pod), will not catch the NFS issue
    • Something a bit more complex like reading the logfile does not catch the NFS issue (as it can read it, the issue seems to be when writing)
    • You'd need something in the lines of checking the age of the logfile, or the last entry, or creating a "heartbeat" file from the process and checking it's existance + deleting it from the liveness probe or similar

So yep, besides preventing this from happening (the NFS stuff, long running issue), I think that the change in concurrencyPolicy will help people get unstuck and be able to run their jobs sometimes, with the note that if their job is meant to run for longer than one period (as some are), they will never finish.

I killed the stuck job

Can you elaborate a bit on how did you kill the stuck job?
(that might help pinpoint the underlying issue, and help trim down the list of potential solutions)

I thought about kubectl delete pod $POD but went with toolforge jobs delete approve; toolforge jobs load jobs.yaml.

The job is configured to log to disk where there had been no activity since 2024-11-04T17:34:04Z. Log snippet for timestamps:

2024-11-04T17:33:56Z glaab.utils INFO: Checking mgagat
2024-11-04T17:34:02Z glaab.utils INFO: Checking okeamah
2024-11-04T17:34:03Z glaab.utils INFO: Checking xuhao61
2024-11-04T17:34:04Z glaab.utils INFO: Checking cybel
2024-11-06T00:18:08Z glaab.utils INFO: Checking edriiic
2024-11-06T00:18:09Z glaab.utils INFO: Checking geppy
2024-11-06T00:18:10Z glaab.utils INFO: Checking funa-enpitu
2024-11-06T00:18:10Z glaab.utils INFO: Checking nfontes

I did not attempt to capture where the job was running when it got stuck unfortunately, so there probably is not a lot to learn here other than that the issue happened.

I did not attempt to capture where the job was running when it got stuck unfortunately, so there probably is not a lot to learn here other than that the issue happened.

Thanks for elaborating, this sounds like stuck on NFS yep, a simple kill would probably not have helped (ex. using timeout wrapper), it would have to be handled by k8s moving it to a different node. That makes me wonder, do liveness probe failures move the pod to a different node?

dcaro triaged this task as High priority.Nov 7 2024, 4:45 PM

This has all happened before: T306391#9436882

Yep, NFS/kernel/not sure what has changed in it's behavior since the first issues, and now k8s is able to delete the pods that are stuck by it and start them in another place, I'm reprioritizing this, probably will try to get one of the options done next round.

That makes me wonder, do liveness probe failures move the pod to a different node?

Nope :/, so livenessProbes would not help NFS stuck jobs get unstuck

bd808 lowered the priority of this task from High to Medium.Nov 7 2024, 5:38 PM

dcaro triaged this task as High priority.

Setting this to the normal "medium" priority. Note this task is specific to the Tool-gitlab-account-approval tool and not a general "make things better in Toolforge" feature request/bug report.

Setting this to the normal "medium" priority. Note this task is specific to the Tool-gitlab-account-approval tool and not a general "make things better in Toolforge" feature request/bug report.

oh, yep, sorry about that

bd808 claimed this task.

This seems to have been fixed by T306391: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime and timeout: 150 in the job specification.