Rethink job retries in case of failures
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Giftpflanze
	Mar 28 2022, 7:30 PM

Description

The error stream appears twice in the .err file.

Details

	Subject	Repo	Branch	Lines +/-
	jobs-framework-cli: add --retry to cli	cloud/toolforge/jobs-framework-cli	master	+89 -10
	jobs-framework-api: add --retry to api	cloud/toolforge/jobs-framework-api	main	+95 -34

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• JHedden	T251027 "signatures" tool has failed job pods on Kubernetes cluster
Resolved		aborrero	T251917 Design the Jobs service in k8s
Resolved		aborrero	T283238 Toolforge: develop jobs-framework-api
Resolved		aborrero	T285944 Toolforge: beta phase for the new jobs framework
Resolved		aborrero	T327254 WMCS FY22/23 Q3: next steps in grid engine deprecation
Resolved	BUG REPORT	Raymond_Ndibe	T304893 Rethink job retries in case of failures

Event Timeline

Giftpflanze created this task.Mar 28 2022, 7:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 28 2022, 7:30 PM

bd808 added a parent task: T285944: Toolforge: beta phase for the new jobs framework.Mar 28 2022, 7:31 PM

Apparently k8s tries to run the command a second time in case of failure. Is that intentional?

bd808 changed the subtype of this task from "Task" to "Bug Report".Apr 6 2022, 9:59 PM

In T304893#7814352, @Giftpflanze wrote:

Apparently k8s tries to run the command a second time in case of failure. Is that intentional?

The jobs have backoffLimit: 1, so they will be retried once before being considered failed.

bd808 edited projects, added Toolforge Jobs framework; removed Toolforge.May 31 2022, 8:02 PM

maybe this should be closed then if this is acceptable behavior?

Retrying failed jobs is not always acceptable. There should be an option to try jobs only once.

In T304893#8149165, @Giftpflanze wrote:

Retrying failed jobs is not always acceptable. There should be an option to try jobs only once.

In that case it makes sense to make this configurable. I will go ahead and create a phab task for this

Raymond_Ndibe mentioned this in T315114: Make it possible to configure retry policy for jobs executed on the toolforge jobs framework.Aug 12 2022, 8:38 PM

Raymond_Ndibe claimed this task.Aug 29 2022, 10:29 PM

Raymond_Ndibe added a project: User-Raymond_Ndibe.

Raymond_Ndibe moved this task from Backlog to In Review on the User-Raymond_Ndibe board.Sep 1 2022, 2:34 AM

Change 828670 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Change 828669 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669

aborrero subscribed.Sep 26 2022, 11:55 AM

This comment was removed by aborrero.

In T304893#8149165, @Giftpflanze wrote:

Retrying failed jobs is not always acceptable. There should be an option to try jobs only once.

Good point. I think there should be no problem having no retries at all and let the user re-run the job if required.

aborrero renamed this task from Stderr is doubled with toolforge-jobs to Rethink job retries in case of failures.Sep 26 2022, 11:58 AM

In T304893#8259896, @aborrero wrote:

My proposal is that we leave the current filelog option as is. I think the investment that will really benefit us is trying to work on the root problem: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

@aborrero this task is specifically about retry policy and not logs so this comment can be remove no?

In T304893#8280483, @Raymond_Ndibe wrote:

In T304893#8259896, @aborrero wrote:

My proposal is that we leave the current filelog option as is. I think the investment that will really benefit us is trying to work on the root problem: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

@aborrero this task is specifically about retry policy and not logs so this comment can be remove no?

right, sorry for the noise.

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

In T304893#8283608, @JJMC89 wrote:

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

The patch being introduced attempts to solve the issue by introducing a --retry option. Not specifying --retry defaults to 0 which really is the more intuitive behavior. if you need it to retry, you simple specify --retry <0-5>. This solves the issue you are concerned with no?

This solves the issue you are concerned with no?

Yes, was just responding to aborrero's comment

In T304893#8283608, @JJMC89 wrote:

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

You can maintain a jobs.yaml file as part of your tool source code and load it every time you need. I think this is very similar to maintaining a crontab file. In fact, in my opinion, the yaml format is better than the crontab format :-P

Anyways we can easily extend the CLI to allow incremental loads of jobs (in addition to just flushing them). But that would be a separate ticket. Would that work for you?

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

The behavior you are describing is weak, somewhat arbitrary and can lead to cumbersome and hard to debug scenarios, in which it may not clear why or how many times a job has been restarted/retried.
Cronjobs are, by their scheduled nature, meant to be run again. I think that most cron schedulers work like this: if a given cronjob run fails you have to wait until the next run. If a failure happens, be it the job itself, the environment or the system, a clear failure is what should be reported to the user.

To be clear, I consider the current retry policy a bug that should be fixed. I'm convinced that no retries at all is more consistent, more robust and a more elegant semantic.

In T304893#8285843, @aborrero wrote:

In T304893#8283608, @JJMC89 wrote:

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

You can maintain a jobs.yaml file as part of your tool source code and load it every time you need. I think this is very similar to maintaining a crontab file. In fact, in my opinion, the yaml format is better than the crontab format :-P

Anyways we can easily extend the CLI to allow incremental loads of jobs (in addition to just flushing them). But that would be a separate ticket. Would that work for you?

With crontab, you can edit the file (including making changes to the definition of a currently running job) without it affecting/terminating currently running jobs. I have some cronjobs that are long-running, so I don't want to interrupt them while they are running. Currently, this means I need to time loading the yaml file instead of being able to load it whenever I want, like I can with crontab. I'm not sure you will be able to easily achieve that due to how k8s cronjobs work.

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

The behavior you are describing is weak, somewhat arbitrary and can lead to cumbersome and hard to debug scenarios, in which it may not clear why or how many times a job has been restarted/retried.
Cronjobs are, by their scheduled nature, meant to be run again. I think that most cron schedulers work like this: if a given cronjob run fails you have to wait until the next run. If a failure happens, be it the job itself, the environment or the system, a clear failure is what should be reported to the user.

It is how k8s cronjobs are designed to work by default. With proper logging/alerting, you can determine why/how often a job is retried. How is no retries more robust?
For most of my cronjobs, I don't care how many times it retries due to failure as long as it eventually succeeds and failures are reported (currently broken with nothing being done about it), preferably without my intervention (especially since manually rerunning the job in k8s is more cumbersome than with the grid) as to not waste my time.

To be clear, I consider the current retry policy a bug that should be fixed. I'm convinced that no retries at all is more consistent, more robust and a more elegant semantic.

I don't. It's a feature - one that you explicitly set when creating toolforge-jobs.

Disabling/removing useful features (and not properly maintaining the utility) just makes me not want to use it (or the platform).

PeterBowman subscribed.Oct 11 2022, 10:08 AM

PeterBowman mentioned this in T319958: Migrate pbbot from Toolforge GridEngine to Toolforge Kubernetes.Oct 11 2022, 10:14 AM

The current plan is the following:

set the default retries to 0
introduce a config flag for users to be able to establish their own retry policy

hope this addresses your concerns @JJMC89

I think we will move forward with https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/828670 as soon as it's ready.

Change 828670 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Change 828669 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669

Raymond_Ndibe mentioned this in rCTKFac8b96d5a3ce: jobs-framework-cli: add --retry to cli.Jan 16 2023, 11:36 PM

Maintenance_bot removed a project: Patch-For-Review.Jan 17 2023, 12:30 AM

Raymond_Ndibe closed this task as Resolved.Jan 17 2023, 2:09 PM

Raymond_Ndibe reopened this task as In Progress.Jan 17 2023, 2:35 PM

aborrero added a parent task: T327254: WMCS FY22/23 Q3: next steps in grid engine deprecation.Jan 18 2023, 11:44 AM

aborrero closed this task as Resolved.Jan 24 2023, 4:34 PM

Rethink job retries in case of failuresClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related ObjectsSearch...

Event Timeline

Rethink job retries in case of failures
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...