At some faily threshold, we should disable the job. Backoff strategies might vary according to job.
This will be a fun conversation. History should be stored somewhere pleasant to access. Maybe next to the lockfiles, as YAML?
ON HOLD: we might be able to do this through Icinga.