Feature summary:
Add --retry to list of toolforge-jobs run parameters.
Use case(s):
As reported by a number of users and observed by developers, when a job that is submitted to the toolforge jobs framework fails, it gets retried exactly once before being considered as failed.
There are scenarios where a user doesn't want a retry of failed jobs to occur (as can be seen in T304893) and there are also scenarios where users might desire retry(ies) to happen (this can be seen being discussed under this gerrit patch https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665).
Since kubernetes already exposes this functionality, it won't take much to implement this on the wrapper api level and make it available to users.
Benefits:
- Adding this option ensures that there is no duplication of errors in the error log (unless the user explicitly wants it to be so) as is currently the case. This makes it less confusing to debug job failures.
- This also ensures that we don't retry failed jobs if a retry is not considered important by the user that initiated the job, thus saving resources.
- Also, some users might want to retry a job more than once if a failure occurs, this also makes that possible.
Side Thought
It also makes sense while implementing this feature to add a reasonable maximum retry limit to enforce the responsible use of this feature.