Page MenuHomePhabricator

Rethink job retries in case of failures
Closed, ResolvedPublicBUG REPORT

Description

The error stream appears twice in the .err file.

Event Timeline

Apparently k8s tries to run the command a second time in case of failure. Is that intentional?

bd808 changed the subtype of this task from "Task" to "Bug Report".Apr 6 2022, 9:59 PM

Apparently k8s tries to run the command a second time in case of failure. Is that intentional?

The jobs have backoffLimit: 1, so they will be retried once before being considered failed.

maybe this should be closed then if this is acceptable behavior?

Retrying failed jobs is not always acceptable. There should be an option to try jobs only once.

Retrying failed jobs is not always acceptable. There should be an option to try jobs only once.

In that case it makes sense to make this configurable. I will go ahead and create a phab task for this

Change 828670 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Change 828669 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669

Retrying failed jobs is not always acceptable. There should be an option to try jobs only once.

Good point. I think there should be no problem having no retries at all and let the user re-run the job if required.

aborrero renamed this task from Stderr is doubled with toolforge-jobs to Rethink job retries in case of failures.Sep 26 2022, 11:58 AM

My proposal is that we leave the current filelog option as is. I think the investment that will really benefit us is trying to work on the root problem: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

@aborrero this task is specifically about retry policy and not logs so this comment can be remove no?

My proposal is that we leave the current filelog option as is. I think the investment that will really benefit us is trying to work on the root problem: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

@aborrero this task is specifically about retry policy and not logs so this comment can be remove no?

right, sorry for the noise.

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

The patch being introduced attempts to solve the issue by introducing a --retry option. Not specifying --retry defaults to 0 which really is the more intuitive behavior. if you need it to retry, you simple specify --retry <0-5>. This solves the issue you are concerned with no?

This solves the issue you are concerned with no?

Yes, was just responding to aborrero's comment

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

You can maintain a jobs.yaml file as part of your tool source code and load it every time you need. I think this is very similar to maintaining a crontab file. In fact, in my opinion, the yaml format is better than the crontab format :-P

Anyways we can easily extend the CLI to allow incremental loads of jobs (in addition to just flushing them). But that would be a separate ticket. Would that work for you?

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

The behavior you are describing is weak, somewhat arbitrary and can lead to cumbersome and hard to debug scenarios, in which it may not clear why or how many times a job has been restarted/retried.
Cronjobs are, by their scheduled nature, meant to be run again. I think that most cron schedulers work like this: if a given cronjob run fails you have to wait until the next run. If a failure happens, be it the job itself, the environment or the system, a clear failure is what should be reported to the user.

To be clear, I consider the current retry policy a bug that should be fixed. I'm convinced that no retries at all is more consistent, more robust and a more elegant semantic.

I think there should be no problem having no retries at all and let the user re-run the job if required.

Without automatically repeating failed jobs, using toolforge-jobs cronjobs give me little/no benefit coupled with worse cronjob management compared to crontab for the grid (e.g., loading jobs from a file kills running jobs).

You can maintain a jobs.yaml file as part of your tool source code and load it every time you need. I think this is very similar to maintaining a crontab file. In fact, in my opinion, the yaml format is better than the crontab format :-P

Anyways we can easily extend the CLI to allow incremental loads of jobs (in addition to just flushing them). But that would be a separate ticket. Would that work for you?

With crontab, you can edit the file (including making changes to the definition of a currently running job) without it affecting/terminating currently running jobs. I have some cronjobs that are long-running, so I don't want to interrupt them while they are running. Currently, this means I need to time loading the yaml file instead of being able to load it whenever I want, like I can with crontab. I'm not sure you will be able to easily achieve that due to how k8s cronjobs work.

The current behavior makes it so that I very rarely have to manually trigger a failed cronjob when it fails once (and succeeds on rerun). The failures are most commonly due to prod issues such as read-only or connectivity issues that last longer than the job itself already handles or incidents like this one.

If it were up to me, we would use the k8s default for backoffLimit (6) instead of just 1.

The behavior you are describing is weak, somewhat arbitrary and can lead to cumbersome and hard to debug scenarios, in which it may not clear why or how many times a job has been restarted/retried.
Cronjobs are, by their scheduled nature, meant to be run again. I think that most cron schedulers work like this: if a given cronjob run fails you have to wait until the next run. If a failure happens, be it the job itself, the environment or the system, a clear failure is what should be reported to the user.

It is how k8s cronjobs are designed to work by default. With proper logging/alerting, you can determine why/how often a job is retried. How is no retries more robust?
For most of my cronjobs, I don't care how many times it retries due to failure as long as it eventually succeeds and failures are reported (currently broken with nothing being done about it), preferably without my intervention (especially since manually rerunning the job in k8s is more cumbersome than with the grid) as to not waste my time.

To be clear, I consider the current retry policy a bug that should be fixed. I'm convinced that no retries at all is more consistent, more robust and a more elegant semantic.

I don't. It's a feature - one that you explicitly set when creating toolforge-jobs.

Disabling/removing useful features (and not properly maintaining the utility) just makes me not want to use it (or the platform).

The current plan is the following:

  1. set the default retries to 0
  2. introduce a config flag for users to be able to establish their own retry policy

hope this addresses your concerns @JJMC89

I think we will move forward with https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/828670 as soon as it's ready.

Change 828670 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --retry to api

https://gerrit.wikimedia.org/r/828670

Change 828669 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --retry to cli

https://gerrit.wikimedia.org/r/828669