Page MenuHomePhabricator

process-control repeated failure handling
Closed, DeclinedPublic2 Estimated Story Points

Description

At some faily threshold, we should disable the job. Backoff strategies might vary according to job.

This will be a fun conversation. History should be stored somewhere pleasant to access. Maybe next to the lockfiles, as YAML?

ON HOLD: we might be able to do this through Icinga.

Event Timeline

Ejegg triaged this task as Medium priority.Mar 28 2017, 9:18 PM
Ejegg set the point value for this task to 2.
cwdent raised the priority of this task from Medium to High.Jul 26 2017, 8:16 PM
cwdent added subscribers: Jgreen, cwdent.

Today's mailstrom (ha ha) warrants re-prioritizing this issue. p-c should stop jobs at a fail mail threshold, something like 5 mails in 5 minutes.

Of course most of the alerts today were not "fail mail" (though there were a bunch of those too) but automated icinga checks. Once prometheus has been built up for fundraising we can explore options for making safety checks disable jobs and/or kill processes. But the fail mail switch alone would still be very useful. In the mean time @Jgreen and I will be more aggressive about stopping things when they start hollering.

Silverpop fetch jobs would be a great place to start. Often just pausing for a while is harmless and fixes it all.

If you end up handling at the Python level, there's a nice library "retry" which includes geometric back-off and stuff.

I'm thinking about this and am not really sure what we need.

Looking at the silverpop case, it's a job that runs once daily and I believe if it fails fr-tech reruns it manually. Process-control could be modified to pay attention to the exit status, and keep retrying at a configured interval until it sees a clean exit. I think we would do something like notify that it failed the first time and then remain silent until the successful run, when we'd send a recovery notification.

Another use case is a queue consumer that runs every minute, where we want the opposite--we want it to stop running after some number of failures, and stay stopped until someone intervenes. How would notification work in this case? Maybe the usual cronspam for each failure until the limit is reached, and then a notification that process-control is taking the job offline?

Are there any other use-cases to consider?

Jgreen lowered the priority of this task from High to Medium.Jun 6 2022, 5:44 PM
Jgreen moved this task from Watching to Done on the fundraising-tech-ops board.

Declining this task due to lack of interest. Plus we're probably better off treating process-control as the simple cron tasker it already is, and leaving the smarter logic to the jobs themselves.