Motive
Re-running failed refine jobs is one of the more time consuming ops-week tasks. The way we do it leaves us open to discarding unknown amounts of data (when we drop malformed, we don't check what percentage of the total number of records were dropped). Some of this can probably be automated.
Proposal
- Change refine to run in PERMISSIVE mode by default.
- Check the percentage of records dropped, and, if it's above K (K = 1?) then alert.
This should eliminate most of the day to day intervention into refine and alert us when something more actionable is going on.