Make it possible to configure retry policy for jobs executed on the toolforge jobs framework
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Raymond_Ndibe
	Aug 12 2022, 8:38 PM

Description

Feature summary:
Add --retry to list of toolforge-jobs run parameters.

Use case(s):
As reported by a number of users and observed by developers, when a job that is submitted to the toolforge jobs framework fails, it gets retried exactly once before being considered as failed.
There are scenarios where a user doesn't want a retry of failed jobs to occur (as can be seen in T304893) and there are also scenarios where users might desire retry(ies) to happen (this can be seen being discussed under this gerrit patch https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665).
Since kubernetes already exposes this functionality, it won't take much to implement this on the wrapper api level and make it available to users.

Benefits:

Adding this option ensures that there is no duplication of errors in the error log (unless the user explicitly wants it to be so) as is currently the case. This makes it less confusing to debug job failures.
This also ensures that we don't retry failed jobs if a retry is not considered important by the user that initiated the job, thus saving resources.
Also, some users might want to retry a job more than once if a failure occurs, this also makes that possible.

Side Thought
It also makes sense while implementing this feature to add a reasonable maximum retry limit to enforce the responsible use of this feature.

Details

	Subject	Repo	Branch	Lines +/-
	jobs-framework-cli: add --retry to cli	cloud/toolforge/jobs-framework-cli	master	+89 -10
	jobs-framework-api: add --retry to api	cloud/toolforge/jobs-framework-api	main	+95 -34

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• JHedden	T251027 "signatures" tool has failed job pods on Kubernetes cluster
Resolved	aborrero	T251917 Design the Jobs service in k8s
Resolved	aborrero	T283238 Toolforge: develop jobs-framework-api
Resolved	aborrero	T285944 Toolforge: beta phase for the new jobs framework
Resolved	aborrero	T327254 WMCS FY22/23 Q3: next steps in grid engine deprecation
Resolved	Raymond_Ndibe	T315114 Make it possible to configure retry policy for jobs executed on the toolforge jobs framework

Event Timeline

Raymond_Ndibe created this task.Aug 12 2022, 8:38 PM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptAug 12 2022, 8:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Raymond_Ndibe renamed this task from Make it possible to configure retry policy on failure for jobs executed on the toolforge jobs framework to Make it possible to configure retry policy for jobs executed on the toolforge jobs framework.Aug 12 2022, 8:42 PM

Raymond_Ndibe added a parent task: T285944: Toolforge: beta phase for the new jobs framework.

Raymond_Ndibe claimed this task.Aug 24 2022, 9:32 PM

Raymond_Ndibe added a project: User-Raymond_Ndibe.Aug 29 2022, 10:26 PM

Change 828669 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669

gerritbot added a project: Patch-For-Review.Sep 1 2022, 2:28 AM

Change 828670 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Raymond_Ndibe moved this task from Backlog to In Review on the User-Raymond_Ndibe board.Sep 1 2022, 2:34 AM

fnegri subscribed.Sep 5 2022, 9:36 AM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 27 2022, 9:24 PM

There is an alternative approach to this, which is https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665 ie: don't do retries at all.

I like that one more. Thoughts?

In T315114#8282172, @aborrero wrote:

There is an alternative approach to this, which is https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/820665 ie: don't do retries at all.

I like that one more. Thoughts?

See my CR and T304893#8283608.

PeterBowman subscribed.Oct 11 2022, 10:09 AM

PeterBowman mentioned this in T319958: Migrate pbbot from Toolforge GridEngine to Toolforge Kubernetes.Oct 11 2022, 10:14 AM

Giftpflanze subscribed.Oct 13 2022, 2:10 PM

Change 828670 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Change 828669 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669

Raymond_Ndibe mentioned this in rCTKFac8b96d5a3ce: jobs-framework-cli: add --retry to cli.Jan 16 2023, 11:36 PM

Maintenance_bot removed a project: Patch-For-Review.Jan 17 2023, 12:30 AM

Raymond_Ndibe closed this task as Resolved.Jan 17 2023, 2:07 PM

Raymond_Ndibe reopened this task as In Progress.Jan 17 2023, 2:34 PM

aborrero added a parent task: T327254: WMCS FY22/23 Q3: next steps in grid engine deprecation.Jan 18 2023, 11:44 AM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:26 PM

fnegri moved this task from Kanban to Doing? (legacy column) on the cloud-services-team board.

fnegri edited projects, added cloud-services-team (FY2022/2023-Q3); removed cloud-services-team.Jan 19 2023, 12:47 PM

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q3) board.

aborrero closed this task as Resolved.Jan 24 2023, 4:34 PM

dcaro moved this task from In progress to Done on the cloud-services-team (FY2022/2023-Q3) board.Feb 14 2023, 1:32 PM

Make it possible to configure retry policy for jobs executed on the toolforge jobs frameworkClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Make it possible to configure retry policy for jobs executed on the toolforge jobs framework
Closed, ResolvedPublic
Actions

Related Objects
Search...