The SRE observability team wants to evaluate migrating from our current paging implementation. We have noted past issues with our paging implementation visible in T274663: Icinga meta monitoring recovery didn't resolve VO page and T264016: Host page did not auto-resolve in VO, minor problems with app sessions in T284215: Splunk oncall / victorops mobile app logout tracking, and run into a gap in functionality impacting our business on-call paging rotation T313603: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers.
We also have collected feedback regarding the user experience of Splunk On-Call and have identified the following deficiencies:
- Exporting the calendar to view roster/exceptions periodically is clunky as you cannot filter out rotations (batphone)
- There is no way to get shift start/end notifications for users who do not use the mobile app
- Shift start/end notifications are not configurable (i.e., would like to get notified on Friday for a shift that starts on Monday)
- Exception scheduling is not intuitive (having comparable timezone views for folks in different locations would help to create overrides)
- "Immediately proceed to next step if no one is on call in current step" escalation policy feature is missing, which has led to delayed after-hours pages (page is routed to no one until 5m timeout reached) as well as confusion if an alert had resolved within that delay window.
- The system lacks an administrative audit trail (e.g. log of configuration changes)
The above situation presents an opportunity to re-assess past decisions, iterate, and propose a solution that more closely matches our current needs. In addition, one of the past factors leading to our current selection was the possibility of using VO to annotate and communicate during incidents. In retrospect. This functionality has not been leveraged or is deficient, prompting T313228: Deploy Dispatch for SRE incident workflow automation to address "in-incident" tooling and communication gaps.
Background:
The original V.O. requirements doc: https://docs.google.com/document/d/1FCbR_R7_itnLcJRJd05S2ZLBTS1lwoI9M2wjHiiGTE4/edit?usp=sharing
These items informed the matrix: https://docs.google.com/spreadsheets/d/1CLILuWaVY4tK6zz3ZXFJy5NO8fq5Nx8eYjwPky3EJnM/edit#gid=0
Todo:
- document updated business case for incident paging
- document requirements / gaps in current solution https://docs.google.com/document/d/1qa60IAreTgbOsafgAJMLJc2OAjb7i7VkosXVKIuGwzM/edit#heading=h.vyt1m0p2t0j6
- evaluate tool from a user perspective (with a goal reduce current friction in the process)
- define and document an implementation plan (technical, comms, timeline) if the evaluation is promising