Q\ What tools should we implement to facilitate responding to incidents?
There is a nontrivial amount of manual work and paperwork currently required of incident responders, on-call staff, incident coordinators, and the onfire team to keep the current IR effort moving forward. We want to reduce the amount of "toil" work from the current incident workflow.
This is a generic big picture wishlist
- Incident Lifecycle Management
- Single Intake
- Create incident from klaxon/phabricator
- Automated create incident (from manual pages, alertmanager, icinga)
- Single Intake
- Google Docs integration (maybe auto-create notes doc)
- Integration with existing tooling for annotations (grafana, logstash/OpenSearch)
- Suppport for Incident stats (duration, category, severity, impact) and additional relevant metadata
- SLOs, Metric Reports, and other trending data
- Integration with Klaxon (future state)
- Integration with Public Status Page (future state)
- Should support Internal notices and message
- Potential integration with Phabricator Ticketing Workflow
- Chat (IRC) Integrations / bots
- Slack support (for fallback)
- Splunk on call / VictorOps integration
Additional considerations that might need a decision:
- hosting our incident response tools out of band.
- build vs buy vs rent (saas)
- open source vs commercial
- Privacy review for SAAS or hosted solutions
Update
We have selected Netflix's Dispatch as our tool of choice for this project. https://github.com/Netflix/dispatch
Update Update
Netflix's Dispatch has proven to probably not suit our needs due to limitations of the platform, relative immaturity of the product, and difficulty establishing a workflow that fits WMF's needs.
The current living document with proposals lives at https://docs.google.com/document/d/1UqNRU0_jv66VLGyrY8QOYKqrM8rn3Iuv70t0JcoNX3M