Page MenuHomePhabricator

Make it possible to configure retry policy for jobs executed on the toolforge jobs framework
Closed, ResolvedPublic

Description

Feature summary:
Add --retry to list of toolforge-jobs run parameters.

Use case(s):
As reported by a number of users and observed by developers, when a job that is submitted to the toolforge jobs framework fails, it gets retried exactly once before being considered as failed.
There are scenarios where a user doesn't want a retry of failed jobs to occur (as can be seen in T304893) and there are also scenarios where users might desire retry(ies) to happen (this can be seen being discussed under this gerrit patch https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665).
Since kubernetes already exposes this functionality, it won't take much to implement this on the wrapper api level and make it available to users.

Benefits:

  • Adding this option ensures that there is no duplication of errors in the error log (unless the user explicitly wants it to be so) as is currently the case. This makes it less confusing to debug job failures.
  • This also ensures that we don't retry failed jobs if a retry is not considered important by the user that initiated the job, thus saving resources.
  • Also, some users might want to retry a job more than once if a failure occurs, this also makes that possible.

Side Thought
It also makes sense while implementing this feature to add a reasonable maximum retry limit to enforce the responsible use of this feature.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Raymond_Ndibe renamed this task from Make it possible to configure retry policy on failure for jobs executed on the toolforge jobs framework to Make it possible to configure retry policy for jobs executed on the toolforge jobs framework.Aug 12 2022, 8:42 PM

Change 828669 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669

Change 828670 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

There is an alternative approach to this, which is https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665 ie: don't do retries at all.

I like that one more. Thoughts?

There is an alternative approach to this, which is https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665 ie: don't do retries at all.

I like that one more. Thoughts?

See my CR and T304893#8283608.

Change 828670 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Change 828669 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669