process-control repeated failure handling
Closed, DeclinedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	awight
	Mar 27 2017, 10:42 PM

Description

At some faily threshold, we should disable the job. Backoff strategies might vary according to job.

This will be a fun conversation. History should be stored somewhere pleasant to access. Maybe next to the lockfiles, as YAML?

ON HOLD: we might be able to do this through Icinga.

Related Objects

Mentioned In: T161569: [Epic] Write basic process-control, something good enough to run all CRM jobs.

Event Timeline

awight created this task.Mar 27 2017, 10:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2017, 10:42 PM

Ejegg triaged this task as Medium priority.Mar 28 2017, 9:18 PM

Ejegg set the point value for this task to 2.

• ggellerman moved this task from Triage to Q3 2021-2022 on the Fundraising-Backlog board.Mar 28 2017, 9:19 PM

awight mentioned this in T161569: [Epic] Write basic process-control, something good enough to run all CRM jobs..Mar 30 2017, 6:45 PM

awight updated the task description. (Show Details)Apr 6 2017, 6:25 PM

• mmodell removed a subscriber: awight.Jun 22 2017, 9:39 PM

Today's mailstrom (ha ha) warrants re-prioritizing this issue. p-c should stop jobs at a fail mail threshold, something like 5 mails in 5 minutes.

Of course most of the alerts today were not "fail mail" (though there were a bunch of those too) but automated icinga checks. Once prometheus has been built up for fundraising we can explore options for making safety checks disable jobs and/or kill processes. But the fail mail switch alone would still be very useful. In the mean time @Jgreen and I will be more aggressive about stopping things when they start hollering.

Silverpop fetch jobs would be a great place to start. Often just pausing for a while is harmless and fixes it all.

Restricted Application added a subscriber: Pcoombe. · View Herald TranscriptAug 12 2018, 6:19 AM

• DStrine moved this task from Q3 2021-2022 to FY 2022-2023 on the Fundraising-Backlog board.May 6 2019, 9:35 PM

Dwisehaupt subscribed.Oct 9 2019, 6:01 PM

Jgreen moved this task from Triage to Watching on the fundraising-tech-ops board.Feb 19 2020, 10:51 PM

If you end up handling at the Python level, there's a nice library "retry" which includes geometric back-off and stuff.

I'm thinking about this and am not really sure what we need.

Looking at the silverpop case, it's a job that runs once daily and I believe if it fails fr-tech reruns it manually. Process-control could be modified to pay attention to the exit status, and keep retrying at a configured interval until it sees a clean exit. I think we would do something like notify that it failed the first time and then remain silent until the successful run, when we'd send a recovery notification.

Another use case is a queue consumer that runs every minute, where we want the opposite--we want it to stop running after some number of failures, and stay stopped until someone intervenes. How would notification work in this case? Maybe the usual cronspam for each failure until the limit is reached, and then a notification that process-control is taking the job offline?

Are there any other use-cases to consider?

Jgreen claimed this task.Jun 6 2022, 5:13 PM

Jgreen lowered the priority of this task from High to Medium.Jun 6 2022, 5:44 PM

Declining this task due to lack of interest. Plus we're probably better off treating process-control as the simple cron tasker it already is, and leaving the smarter logic to the jobs themselves.

process-control repeated failure handlingClosed, DeclinedPublic2 Estimated Story PointsActions

Description

Related Objects

Event Timeline

process-control repeated failure handling
Closed, DeclinedPublic2 Estimated Story Points
Actions