"Keep this job running until the job decides to exit cleanly"
As a tool maintainer
I want to run a long running task that may need to be restarted 6 or more times before completing successfully
So I can implement a robust solution for a bot
Today toolforge jobs run supports three major patterns for dealing with task failures:
- --continuous marks the task as always requiring a restart on termination of the PID1 process in the container.
- --retry {0,1,2,3,4,5} marks the task as requiring a restart on termination of the PID1 process in the container if and only if the exit status of the process is non-zero (indicating an error) and there have been less than N prior restarts. N here has been deliberately constrained by the toolforge jobs run argument parser to be one of 6 possible values.
- If neither --continuous nor --retry N have been provided, never restart the task no matter the PID1 exit status.
The new feature request could be seen as variation of either the --continuous or --retry N patterns. Sometimes a task should ideally be restarted any time it exits with a non-zero status, but also stay terminated in the case of a status zero exit. The bot-wrapper.sh script from @Anomie's AnomieBOT is an example of this pattern being implemented in a shell script. The logic there is functionally "until X returns 0, run X".
Possible CLI implementations:
- --retry onfailure
- --retry forever
- --retry MAXINT
- --continuous --finish-ok
- --continuous --finish-on-success
- --continuous --finish-on 0