Page MenuHomePhabricator

Add --timeout to toolforge jobs
Closed, DuplicatePublicFeature

Description

See T376099: --timeout flag for mwscript-k8s for a similar feature request in a different Kubernetes environment. Here a timeout option using activeDeadlineSeconds in Kubernetes was implemented. That option is currently missing in Toolforge job submission.

I used to use this a lot on the grid to automatically have jobs killed that got stuck. Having this option in the job submission would make that possible again.

Event Timeline

Restricted Application removed a subscriber: taavi. · View Herald Transcript

As a workaround for now, this can be done by using the timeout command wrapping your job command, like:

toolforge jobs run --schedule '00 * * * * *' --command 'timeout 1h mycommand' --image python3.11 myjob

Or in a Procfile for a build-service image:

## Procfile
dojob: timeout 1h mycommand
toolforge jobs run --schedule '00 * * * * *' --image tool-mytool/tool-mytool:latest --command dojob myjob

Full help:

dcaro@urcuchillay$ timeout --help
Usage: timeout [OPTION] DURATION COMMAND [ARG]...
  or:  timeout [OPTION]
Start COMMAND, and kill it if still running after DURATION.

Mandatory arguments to long options are mandatory for short options too.
      --preserve-status
                 exit with the same status as COMMAND, even when the
                   command times out
      --foreground
                 when not running timeout directly from a shell prompt,
                   allow COMMAND to read from the TTY and get TTY signals;
                   in this mode, children of COMMAND will not be timed out
  -k, --kill-after=DURATION
                 also send a KILL signal if COMMAND is still running
                   this long after the initial signal was sent
  -s, --signal=SIGNAL
                 specify the signal to be sent on timeout;
                   SIGNAL may be a name like 'HUP' or a number;
                   see 'kill -l' for a list of signals
  -v, --verbose  diagnose to stderr any signal sent upon timeout
      --help        display this help and exit
      --version     output version information and exit

DURATION is a floating point number with an optional suffix:
's' for seconds (the default), 'm' for minutes, 'h' for hours or 'd' for days.
A duration of 0 disables the associated timeout.

Upon timeout, send the TERM signal to COMMAND, if no other SIGNAL specified.
The TERM signal kills any process that does not block or catch that signal.
It may be necessary to use the KILL signal, since this signal can't be caught.

Exit status:
  124  if COMMAND times out, and --preserve-status is not specified
  125  if the timeout command itself fails
  126  if COMMAND is found but cannot be invoked
  127  if COMMAND cannot be found
  137  if COMMAND (or timeout itself) is sent the KILL (9) signal (128+9)
  -    the exit status of COMMAND otherwise

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/timeout>
or available locally via: info '(coreutils) timeout invocation'

Example of a failed run:

tools.automated-toolforge-tests@tools-bastion-13:~$ toolforge jobs show myjob
+---------------+------------------------------------------------------------------+
| Job name:     | myjob                                                            |
+---------------+------------------------------------------------------------------+
| Command:      | timeout 1s sleep 10s                                             |
+---------------+------------------------------------------------------------------+
| Job type:     | schedule: * * * * *                                              |
+---------------+------------------------------------------------------------------+
| Image:        | python3.11                                                       |
+---------------+------------------------------------------------------------------+
| Port:         | none                                                             |
+---------------+------------------------------------------------------------------+
| File log:     | yes                                                              |
+---------------+------------------------------------------------------------------+
| Output log:   | /data/project/automated-toolforge-tests/myjob.out                |
+---------------+------------------------------------------------------------------+
| Error log:    | /data/project/automated-toolforge-tests/myjob.err                |
+---------------+------------------------------------------------------------------+
| Emails:       | none                                                             |
+---------------+------------------------------------------------------------------+
| Resources:    | default                                                          |
+---------------+------------------------------------------------------------------+
| Replicas:     | 1                                                                |
+---------------+------------------------------------------------------------------+
| Mounts:       | all                                                              |
+---------------+------------------------------------------------------------------+
| Retry:        | no                                                               |
+---------------+------------------------------------------------------------------+
| Health check: | none                                                             |
+---------------+------------------------------------------------------------------+
| Status:       | Failed                                                           |
+---------------+------------------------------------------------------------------+
| Hints:        | Last run at 2024-10-28T09:24:53Z. Pod in 'Failed' phase. State   |
|               | 'terminated'. Reason 'Error'. Started at '2024-10-28T09:24:54Z'. |
|               | Finished at '2024-10-28T09:24:55Z'. Exit code '124'.             |
+---------------+------------------------------------------------------------------+
dcaro triaged this task as Low priority.Oct 29 2024, 4:03 PM
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.

Note that this solution would not help with jobs that get stuck due to NFS misbehaving (so far all the instances I've seen), as those jobs are considered 'active' by k8s.

Note that this solution would not help with jobs that get stuck due to NFS misbehaving (so far all the instances I've seen), as those jobs are considered 'active' by k8s.

Wait no, ignore this comment, the activeDeadlineSeconds would actually work as expected :), got confused with a liveness probe