Page MenuHomePhabricator

[jobs-cli,jobs-api] Provide a means to configure a task to be restarted indefinately upon error, but terminate normally otherwise
Open, Needs TriagePublicFeature

Description

"Keep this job running until the job decides to exit cleanly"

As a tool maintainer
I want to run a long running task that may need to be restarted 6 or more times before completing successfully
So I can implement a robust solution for a bot

Today toolforge jobs run supports three major patterns for dealing with task failures:

  • --continuous marks the task as always requiring a restart on termination of the PID1 process in the container.
  • --retry {0,1,2,3,4,5} marks the task as requiring a restart on termination of the PID1 process in the container if and only if the exit status of the process is non-zero (indicating an error) and there have been less than N prior restarts. N here has been deliberately constrained by the toolforge jobs run argument parser to be one of 6 possible values.
  • If neither --continuous nor --retry N have been provided, never restart the task no matter the PID1 exit status.

The new feature request could be seen as variation of either the --continuous or --retry N patterns. Sometimes a task should ideally be restarted any time it exits with a non-zero status, but also stay terminated in the case of a status zero exit. The bot-wrapper.sh script from @Anomie's AnomieBOT is an example of this pattern being implemented in a shell script. The logic there is functionally "until X returns 0, run X".

Possible CLI implementations:

  • --retry onfailure
  • --retry forever
  • --retry MAXINT
  • --continuous --finish-ok
  • --continuous --finish-on-success
  • --continuous --finish-on 0