Page MenuHomePhabricator

process-control repeated failure handling
Open, HighPublic2 Estimated Story Points

Description

At some faily threshold, we should disable the job. Backoff strategies might vary according to job.

This will be a fun conversation. History should be stored somewhere pleasant to access. Maybe next to the lockfiles, as YAML?

ON HOLD: we might be able to do this through Icinga.

Event Timeline

Ejegg triaged this task as Medium priority.Mar 28 2017, 9:18 PM
Ejegg set the point value for this task to 2.
cwdent raised the priority of this task from Medium to High.Jul 26 2017, 8:16 PM
cwdent added subscribers: Jgreen, cwdent.

Today's mailstrom (ha ha) warrants re-prioritizing this issue. p-c should stop jobs at a fail mail threshold, something like 5 mails in 5 minutes.

Of course most of the alerts today were not "fail mail" (though there were a bunch of those too) but automated icinga checks. Once prometheus has been built up for fundraising we can explore options for making safety checks disable jobs and/or kill processes. But the fail mail switch alone would still be very useful. In the mean time @Jgreen and I will be more aggressive about stopping things when they start hollering.

Silverpop fetch jobs would be a great place to start. Often just pausing for a while is harmless and fixes it all.

If you end up handling at the Python level, there's a nice library "retry" which includes geometric back-off and stuff.