Page MenuHomePhabricator

implementing an incident response workflow automation tool for SRE
Open, MediumPublic

Description

Q\ What tools should we implement to facilitate responding to incidents?

There is a nontrivial amount of manual work and paperwork currently required of incident responders, on-call staff, incident coordinators, and the onfire team to keep the current IR effort moving forward. We want to reduce the amount of "toil" work from the current incident workflow.

This is a generic big picture wishlist

  • Incident Lifecycle Management
    • Single Intake
      • Create incident from klaxon/phabricator
      • Automated create incident (from manual pages, alertmanager, icinga)
  • Google Docs integration (maybe auto-create notes doc)
  • Integration with existing tooling for annotations (grafana, logstash/OpenSearch)
  • Suppport for Incident stats (duration, category, severity, impact) and additional relevant metadata
  • SLOs, Metric Reports, and other trending data
  • Integration with Klaxon (future state)
  • Integration with Public Status Page (future state)
  • Should support Internal notices and message
  • Potential integration with Phabricator Ticketing Workflow
  • Chat (IRC) Integrations / bots
  • Slack support (for fallback)
  • Splunk on call / VictorOps integration

Additional considerations that might need a decision:

  • hosting our incident response tools out of band.
  • build vs buy vs rent (saas)
  • open source vs commercial
  • Privacy review for SAAS or hosted solutions

Update

We have selected Netflix's Dispatch as our tool of choice for this project. https://github.com/Netflix/dispatch

Update Update

Netflix's Dispatch has proven to probably not suit our needs due to limitations of the platform, relative immaturity of the product, and difficulty establishing a workflow that fits WMF's needs.

The current living document with proposals lives at https://docs.google.com/document/d/1UqNRU0_jv66VLGyrY8QOYKqrM8rn3Iuv70t0JcoNX3M

Related Objects

StatusSubtypeAssignedTask
OpenNone
DeclinedNone
Resolvedherron
Declinedfgiunchedi
Resolvedfgiunchedi
Declinedandrea.denisse
DeclinedNone
DeclinedNone
Resolvedherron
DeclinedNone
Resolvedherron
DeclinedNone
DeclinedNone
DeclinedNone
DeclinedNone
DeclinedNone
ResolvedEevans
Resolvedjhathaway
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedEevans
ResolvedEevans
ResolvedEevans
ResolvedBCornwall
ResolvedBCornwall
Resolvedandrea.denisse
ResolvedBCornwall
ResolvedEevans
ResolvedEevans
ResolvedEevans
ResolvedEevans
Resolvedjhathaway
ResolvedEevans
ResolvedEevans
ResolvedEevans
ResolvedEevans
OpenNone
OpenNone
OpenNone
DuplicateNone
OpenNone
ResolvedEevans

Event Timeline

lmata renamed this task from untitled masterwork on incident response automation to implementing an incident response workflow automation tool for SRE.May 17 2022, 1:37 AM
lmata triaged this task as Medium priority.
lmata updated the task description. (Show Details)
lmata added a project: SRE-OnFire.

After some discussion in our last ONFIRE meeting it appears that our most basic needs comprise of:

  1. A real-time editor for in-the-moment information management
  2. A technical platform that allows us to manage immediate incident-related action items as well as any follow-up items post-incident
  3. A predictable place for "finished" incident documents to live for later consumption

The rest are "nice-to-have" integrations of varying importance (like notifying people on IRC/Slack).

Our current setup uses Google Docs for 1, Phabricator for 2, and Wikitech for 3. I propose we start with these first three requirements, see what solutions work better than our current one and then work our way down with the integrations/niceties. We've been circling the wagons trying to come up with some sort of integrated approach but tend to just keep coming back to a slightly altered version of the present one.

A recurring pain point is information duplication/transference: Synchronizing the same information between our bug report, our real-time knowledge/graphs, our IRC, and the final Wikitech article is painful.

Already-rejected ideas:

  • Remove Wikitech and just have the Google doc be the final resting place of the document (Too easy to lose/change the doc)
  • Remove Google docs and just use IRC for the real-time communication (IRC is too primitive to easily collect and maintain all of this info)
  • Remove Wikitech and use the ticket description instead (Too difficult to locate the ticket)

The feedback I've heard in the meetings suggests to me that nobody wants to leave the current phab/gdocs/wt setup (and if we did it'd be pretty much just replacing one component with one very similar).