Page MenuHomePhabricator

Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers
Closed, ResolvedPublic

Description

In VictorOps / Splunk On-Call parlance, business hours oncall is implemented as an "Escalation Policy".

The Batphone is also an "Escalation Policy", and, one escalation policy can trigger another policy if alert(s) go un-acked for a certain interval of time.

This is all fine and good, however the docs state:

Note: If there is no on-call user scheduled in a rotation at the time when this escalation action is triggered, the resulting behavior is that no page will occur in this step. The time delay before the next step will remain as configured. For example, if an incident triggers an Escalation Policy during off-hours and there is no one on call in the rotation to immediately page, the escalation policy will page no one and then wait however long is specified before executing step two.

The VictorOps entity that provides "glue" between an incoming alert and the appropriate Escalation Policy is known as a "Routing Key".

Currently we have a few relevant ones: icinga, netops, sre-batphone, and the default (fall-through). All four are set to trigger the SRE Business Hours escalation policy.

So therefore, when that policy is empty, alerts are delayed by 5 minutes.

In the interests of expediency I recommend working around this ourselves.

We already have code that can check the membership of the current business hours rotation -- it was committed to Klaxon last week.

This code is already installed and configured and running on role::alerting_host hosts.

We could create a new routing key, direct-batphone, that would page the batphone escalation policy directly, skipping business hours.

Then, it would be trivial to poll this VO API once a minute, and (say) write out a file to disk somewhere with the current Routing Key to be used -- the usual value if there are business hours oncallers, or instead force-batphone if there are not.

Event Timeline

BTW, if it would be helpful for me to add a CLI utility to fetch the current business hours oncallers, I'm very happy to do so!

As a very near term stopgap to avoid delays over the weekend I'll be updating the VO config to page batphone immediately towards the end of the day here (Eastern).

JFTR the change is being made in Teams > SRE > Escalation Policies > SRE Business Hours (Escalation). Changing Step 2 from "after 5 minutes" to "immediately". Please see screen shot here as well.

Screen Shot 2022-07-22 at 1.43.25 PM.png (1,934×1,874 px, 217 KB)

At the beginning of the next business hours shift we can consider reverting this (switching step 2 from "immediately" to "after 5 minutes") to reinstate the 5 min timeout before escalation.

We have also raised an issue with VO about this to at a minimum document the problem as it relates to our workflow and log a request for an "immediately escalate if noone is currently oncall in this rotation" option. As an aside -- The current behavior strikes me as a (perhaps known/documented?) bug. I have great difficulty imagining a scenario where the current behavior of allowing critical pages to be escalated to noone would be desired.

We have also raised an issue with VO about this to at a minimum document the problem as it relates to our workflow and log a request for an "immediately escalate if noone is currently oncall in this rotation" option. As an aside -- The current behavior strikes me as a (perhaps known/documented?) bug. I have great difficulty imagining a scenario where the current behavior of allowing critical pages to be escalated to noone would be desired.

Yeah... I made the proposal I did because it did seem like it was a mis-feature that VO had chosen to document rather than to fix, or change, or make configurable. Thanks for raising it with them.

Amusingly they also document this, which at least personally I think of as a severe anti-pattern, or at best, a confusion of a monitoring tool with an escalation tool: https://help.victorops.com/knowledge-base/waiting-room/

Are the people oncall supposed to switch this twice everyday? I've changed it now back to 5 minutes, but the oncall people in NA will need to switch it back to immediately at the end of their shift.

cc @Jelto @Dzahn @ssingh

Are the people oncall supposed to switch this twice everyday? I've changed it now back to 5 minutes, but the oncall people in NA will need to switch it back to immediately at the end of their shift.

cc @Jelto @Dzahn @ssingh

For now I think this is the best approach, unless we want to implement the workaround I proposed above while herron is OOO.

@SLyngshede-WMF had suggested to perhaps use the VictorOps API to automate the switching of time windows between 'after 5 minutes' and 'immediate' for the business hours rotation, however, as it turns out this is not possible:

For the purpose of using this API, escalation policies are treated as immutable. The policy will not be able to be updated or modified in any way via this API once created. It will only be able to be deleted. However, The escalation policies are accessable in the UI once created and can be updated from there.

[sic]

Feedback from Splunk support (anonymized):

the primary reason for the behavior functioning as it does is to allow for what we call Waiting Room functionality. In cases where incidents often automatically resolve themselves within a particular period of time, people can configure escalation policies to only page them if an incident is still triggered after X amount of minutes. We'll sometimes see customers configure the first step to page a business hours rotation as they don't mind a potentially unnecessary ping in the middle of a business day, then only page a 24x7 rotation if it's still unacked/resolved after 15 or so minutes. This way, it only pages them off-hours if it fails to resolve itself in those first 15 minutes. Abiding by the time delays stated in the UI also allows for a number of less-common configuration options.

As for escalating this improvement, we have this feature request (https://ideas.splunk.com/ideas/VOID-I-319) documented for the improvement. If you wouldn't mind logging into the Ideas Portal (separate login from your Splunk On-Call, anyone is welcome to create an account) and add your details alongside the request, it'll help our product team understand the urgency behind it. Apologies that we don't have anything that more directly solves this situation with your current setup!

I have expressed our interest in this feature through support and voting in the portal.

On a separate note, looking through the API and docs, I don't see many options to cover the gap other than an offset batphone schedule.

The mechanics seem we would need 2 additional rotations:

  • One Partial day (a schedule with one or many partial day shifts (ex: Mon-Fri 10 PM - 8 AM UTC) to cover gaps within the week (~8 hours of offset between Americas-> EMEA)
  • One Multi-day (a schedule with one or many multi-day shifts (ex: Fri 10 PM - Mon 8 AM UTC)

So I think we have three options here:

  1. My originally-proposed routing key hack
  2. Doing some API calls to delete and re-create the initial escalation policy depending on business hours vs not
  3. An offset batphone schedule as proposed by Leo

The downside of #1 is that it requires changes to any integration with VO that we have (which I think is just Klaxon, Icinga, and Alertmanager?) The upside is that it the moving parts it adds aren't especially brittle, and it doesn't create any race conditions that might cause alerts to be dropped. It's also transparent to oncallers.

The downside of #2 is that we do potentially create a race condition, since there will be some small interval where no escalation policy is in place. We could work around this by also having the script poll for any new alerts raised between when it starts and finishes. The upside of #2 is that it is transparent to oncallers.

The downside of #3 is that it's not possible to automatically handle any mid-day gaps as often occurs when EUTZ oncallers need to stop their days early and/or USTZ oncallers need to start their days late. It's also manual work. The upside is that it doesn't add moving parts outside of VictorOps itself.

wrt option #2: having just tried it via curl, it looks like we will have to delete and re-create both the escalation policy and the routing keys involved.

A slightly weird way of handling the issue automatically could be using Selenium. It seems a bit overkill for a minor issue, but if it's something we want to automate, that could be a way to do it.

I've been thinking about this problem in recent days; long-term, we will most likely address it through T323958: Evaluate AQS use of GrowthExperiments new impact dashboard. However short term, there is no seemingly ideal path. IMO I think #1 feels like fewer moving parts and possibly safer.

lmata updated Other Assignee, added: herron.

Change 819750 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/klaxon@master] Add VictorOps CLI tool & escalate_unpaged command

https://gerrit.wikimedia.org/r/819750

During discussion with @fgiunchedi today we came up with another option, which we think is the simplest and easiest to implement. This is to simply poll the API for incidents that are un-acked and also haven't paged anyone, and then issue the incidents/reroute API call to make sure they are hitting the batphone rotation as well.

In the current configuration, incidents only get into this state if the business hours rotation is presently empty.

We can run this as a systemd timer from the alerting_hosts. Several times a minute is no problem. The mutating API call is idempotent so there's no worry about race conditions.

We do need to provide a username that will be displayed in the VO UI as the "user" who "rerouted" the page. I suggest we create a dummy user escalator_sysuser.

Change 820100 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP klaxon: run VO escalation of unpaged incidents

https://gerrit.wikimedia.org/r/820100

Change 819750 merged by jenkins-bot:

[operations/software/klaxon@master] Add VictorOps CLI tool & escalate_unpaged command

https://gerrit.wikimedia.org/r/819750

Change 820100 merged by Filippo Giunchedi:

[operations/puppet@production] klaxon: run VO escalation of unpaged incidents

https://gerrit.wikimedia.org/r/820100

Change 820135 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] klaxon: fix escalate_unpaged usage

https://gerrit.wikimedia.org/r/820135

Change 820135 merged by Filippo Giunchedi:

[operations/puppet@production] klaxon: fix escalate_unpaged usage

https://gerrit.wikimedia.org/r/820135

lmata triaged this task as Medium priority.Aug 3 2022, 2:55 PM
lmata moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q1) board.
CDanis raised the priority of this task from Medium to High.

Change 820439 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/klaxon@master] Print VO API response when we do escalate

https://gerrit.wikimedia.org/r/820439

Change 820439 merged by jenkins-bot:

[operations/software/klaxon@master] Print VO API response when we do escalate

https://gerrit.wikimedia.org/r/820439

Change 820800 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] vo-escalate: actually run every 15 seconds

https://gerrit.wikimedia.org/r/820800

Change 820800 merged by CDanis:

[operations/puppet@production] vo-escalate: actually run every 15 seconds

https://gerrit.wikimedia.org/r/820800

During discussion with @fgiunchedi today we came up with another option, which we think is the simplest and easiest to implement. This is to simply poll the API for incidents that are un-acked and also haven't paged anyone, and then issue the incidents/reroute API call to make sure they are hitting the batphone rotation as well.

In the current configuration, incidents only get into this state if the business hours rotation is presently empty.

We can run this as a systemd timer from the alerting_hosts. Several times a minute is no problem. The mutating API call is idempotent so there's no worry about race conditions.

Very elegant solution considering the circumstances! Added some high level docs to wikitech as a reminder for the future https://wikitech.wikimedia.org/wiki/Splunk_On-Call#Automatic_unpaged_alert_re-routing_(klaxon/victorops.py_escalate_unpaged)

Change 821287 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/klaxon@master] Add a check_esc_policy_config subcommand

https://gerrit.wikimedia.org/r/821287

Change 821287 merged by jenkins-bot:

[operations/software/klaxon@master] Add a check_esc_policy_config subcommand

https://gerrit.wikimedia.org/r/821287

As a followup to this past weekend's misconfiguration that delayed paging, victorops.py now has a check_esc_policy_config subcommand.

For every given escalation policy ID, it checks that there is at least one escalation step with timeout=0 that also triggers a rotation_group (the API's internal name for an oncall rotation).

In WMF production, this should be routinely called on the policy IDs for both batphone and business hours.

The command has Nagios/Icinga semantics for exit codes so it would be fine to simply add it as a check_command, or to run it as a systemd timer which is then monitored.

I'll leave implementing that up to @fgiunchedi or @herron :)