Page MenuHomePhabricator

Define what an "incident" and "major incident" are for WME
Closed, ResolvedPublic

Description

Context
WME Engineering has decided to improve the incident management process. The first step is to define what an incident an a major incident are for the team.

Standard definitions

Incident: Any unplanned disruption or degradation of service that is actively affecting customers ability to use PagerDuty.
Major incident: Any incident that requires a coordinated response between multiple teams.

We don't have to use the above definitions, they're just a starting point. The team will come up with the right definitions for Enterprise. The point is that the definition should be a short, simple statement that ensures everyone is on the same page.

Goal
To remove any discussion around whether something is an incident or not during your response process. If we have a metric to use (e.g. "if errors go above 100/minute it's a major incident"), we should use it, If not, we should find another way to define what a major incident is.

Why are we starting here? rationale
According to Incident Response best practices by PagerDuty, this is the most important first step because engineers can't respond to an incident until they know what an incident is:

If one person considers something an incident but the rest of the organization doesn't, that will create ambiguity and confusion during any sort of incident response. Having a clear definition that's disseminated to your entire organization ensures that everyone has the same understanding and will prevent any confusion.

Source: https://response.pagerduty.com/getting_started/

To Do

  • Define as a team what an incident and a major incident are for WMEnterprise
  • Agree as a team on a source of truth where these definitions will live