Page MenuHomePhabricator

Evaluate viable candidates for incident paging
Open, MediumPublic

Description

The SRE observability team wants to evaluate migrating from our current paging implementation. We have noted past issues with our paging implementation visible in T274663: Icinga meta monitoring recovery didn't resolve VO page and T264016: Host page did not auto-resolve in VO, minor problems with app sessions in T284215: Splunk oncall / victorops mobile app logout tracking, and run into a gap in functionality impacting our business on-call paging rotation T313603: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers.

We also have collected feedback regarding the user experience of Splunk On-Call and have identified the following deficiencies:

  • Exporting the calendar to view roster/exceptions periodically is clunky as you cannot filter out rotations (batphone)
  • There is no way to get shift start/end notifications for users who do not use the mobile app
  • Shift start/end notifications are not configurable (i.e., would like to get notified on Friday for a shift that starts on Monday)
  • Exception scheduling is not intuitive (having comparable timezone views for folks in different locations would help to create overrides)
  • "Immediately proceed to next step if no one is on call in current step" escalation policy feature is missing, which has led to delayed after-hours pages (page is routed to no one until 5m timeout reached) as well as confusion if an alert had resolved within that delay window.
  • The system lacks an administrative audit trail (e.g. log of configuration changes)

The above situation presents an opportunity to re-assess past decisions, iterate, and propose a solution that more closely matches our current needs. In addition, one of the past factors leading to our current selection was the possibility of using VO to annotate and communicate during incidents. In retrospect. This functionality has not been leveraged or is deficient, prompting T313228: Deploy Dispatch for SRE incident workflow automation to address "in-incident" tooling and communication gaps.

Background:
The original V.O. requirements doc: https://docs.google.com/document/d/1FCbR_R7_itnLcJRJd05S2ZLBTS1lwoI9M2wjHiiGTE4/edit?usp=sharing
These items informed the matrix: https://docs.google.com/spreadsheets/d/1CLILuWaVY4tK6zz3ZXFJy5NO8fq5Nx8eYjwPky3EJnM/edit#gid=0

Todo:

Event Timeline

lmata renamed this task from evaluate Grafana OnCall as a viable replacement for Splunk On-Call (formerly VictorOps) for incident paging to Evaluate Grafana OnCall as a viable replacement for Splunk On-Call (formerly VictorOps) for incident paging.Jul 27 2022, 8:14 PM
lmata triaged this task as Medium priority.
lmata updated the task description. (Show Details)
herron updated the task description. (Show Details)

Grafana also has announced this year Grafana Incident, which AIUI is presently available in beta via Grafana Cloud https://go2.grafana.com/incident-beta-interest.html

Since we'll be evaluating Grafana on-call here, and are also working to deploy incident management tooling, I think it'd be worth including an evaluation of Grafana Incident. At a minimum it would be helpful to understand if/when an open source release should be expected, as well as to what degree continuing our current approach of externally hosted escalation/incident tooling (VO, Statuspage) makes sense.

lmata renamed this task from Evaluate Grafana OnCall as a viable replacement for Splunk On-Call (formerly VictorOps) for incident paging to Evaluate viable candidates for incident paging.Apr 13 2023, 2:16 AM
lmata updated the task description. (Show Details)