I'm a maintainer of the urbanecmbot tool at Toolforge. On occasion, one of the jobs I have stalls, the pod becomes unresponsive and the job is not doing anything useful. Unfortunately, the framework is still convinced that the job is still running, even though it is not, and as a result, it is not getting rescheduled. This continues until I notice (or am notified) that the job is down, and manually restart it.
An example from today:
tools.urbanecmbot@tools-bastion-13 ~ $ toolforge-jobs show afd-announcer +---------------+-----------------------------------------------------------------------------------------+ | Job name: | afd-announcer | +---------------+-----------------------------------------------------------------------------------------+ | Command: | ~/bin/oznamovatelbot /data/project/urbanecmbot/11bots/cswiki/userbots/announcers/afd.py | +---------------+-----------------------------------------------------------------------------------------+ | Job type: | schedule: */5 * * * * | +---------------+-----------------------------------------------------------------------------------------+ | Image: | python3.9 | +---------------+-----------------------------------------------------------------------------------------+ | Port: | none | +---------------+-----------------------------------------------------------------------------------------+ | File log: | yes | +---------------+-----------------------------------------------------------------------------------------+ | Output log: | /data/project/urbanecmbot/afd-announcer.out | +---------------+-----------------------------------------------------------------------------------------+ | Error log: | /data/project/urbanecmbot/afd-announcer.err | +---------------+-----------------------------------------------------------------------------------------+ | Emails: | onfailure | +---------------+-----------------------------------------------------------------------------------------+ | Resources: | default | +---------------+-----------------------------------------------------------------------------------------+ | Replicas: | 1 | +---------------+-----------------------------------------------------------------------------------------+ | Mounts: | all | +---------------+-----------------------------------------------------------------------------------------+ | Retry: | no | +---------------+-----------------------------------------------------------------------------------------+ | Health check: | none | +---------------+-----------------------------------------------------------------------------------------+ | Status: | Running for 2d6h55m | +---------------+-----------------------------------------------------------------------------------------+ | Hints: | Last run at 2024-10-15T02:20:06Z. Pod in 'Running' phase. State | | | 'running'. Started at '2024-10-15T02:20:07Z'. | +---------------+-----------------------------------------------------------------------------------------+ tools.urbanecmbot@tools-bastion-13 ~ $
I tried execing into the container via kubectl, and the command left hanging forever.
Can we add a timeout or another form of health check to the framework?