Page MenuHomePhabricator

Make RefineFailuresChecker checker jobs use the same parameters as Refine jobs
Closed, ResolvedPublic

Description

In https://gerrit.wikimedia.org/r/c/operations/puppet/+/616198/ we disabled RefineFailuresChecker for refine_eventlogging_analytics. This is fine, as refine_eventlogging_legacy use the same input path and output path, so alert on the same failures.

However, the alerts we get are confusing. If a _REFINE_FAILED flag exists in a partition that is refined by refine_eventlogging_analytics, the alert we get will mention refine_eventlogging_legacy. This leads to confusing things like https://phabricator.wikimedia.org/T274322#6817770 where a job was re-run incorrecly.

I think we should just re-enable RefineFailuresChecker for refine_eventlogging_analytics, but make RefineFailuresChecker jobs use the same table whitelist/blacklist and other parameters that their 'parent' Refine job uses.

However, does this defeat the purpose? Maybe RefineFailuresChecker exists to catch failures from bugs Refine might miss. Although, not really, as Refine is the one writing the _REFINE_FAILED flag anyway.

Alternatively we could just rename the monitor_refine_eventlogging_legacy_failure_flags job to something more generic just based on input and output paths.

Event Timeline

fdans triaged this task as Medium priority.Mar 1 2021, 5:25 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
Ottomata claimed this task.

Done in recent refactors.