Page MenuHomePhabricator

setup polling/alerting for chec_esc_policy_config
Open, MediumPublic

Description

Follow up task

As a followup to this past weekend's misconfiguration that delayed paging, victorops.py now has a check_esc_policy_config subcommand.

For every given escalation policy ID, it checks that there is at least one escalation step with timeout=0 that also triggers a rotation_group (the API's internal name for an oncall rotation).

In WMF production, this should be routinely called on the policy IDs for both batphone and business hours.

The command has Nagios/Icinga semantics for exit codes so it would be fine to simply add it as a check_command, or to run it as a systemd timer which is then monitored.

I'll leave implementing that up to @fgiunchedi or @herron :)