Page MenuHomePhabricator

Decision Request - Incident Response Process
Closed, ResolvedPublic

Description

Problem

When an incident occurs and the WMCS team responds to it, there is not a defined process to follow. This might lead to uncertainty and delays in the response to the incident.

Constraints and risks

  • Not having a process could in some occasions result in a slow or ineffective response to an incident
  • At the same time, a process would involve additional work, and that could make the response slower instead of faster
  • We don't have a clear definition of an incident, and when the WMCS team is responsible for it
  • We don't have many people in the team and the work required by following a process (e.g. writing detailed incident reports after an incident) might reduce the number of other things we can deliver

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T348887_Incident_Response_Process

Options

Option 1

Adopt the Incident Response Process used by the Production SRE team, either without any change or with small changes that apply only to the WMCS team.

Pros:

  • battle-tested process
  • easier to work together when an incident involves both the WMCS team and other teams

Cons:

  • designed for a bigger team
  • designed for production services where incidents can have a much bigger impact compared to WMCS services

Option 2

Write a custom Incident Response Process for the WMCS team, taking inspiration from the Production SRE team but keeping our process separate.

Pros:

  • we can tailor it to our team
  • we can evolve the process independently

Cons:

  • more work to write the process and maintain it
  • potential source of confusion when an incident involves both the WMCS team and other teams

Option 2.1

Write a custom Incidence Response Process that is a subset of the one used by the SRE team, with some minor adaptations to our case.
This include having shared incident reviews, and this means us also going to non-WMCS incident reviews, and other SREs coming to ours (essentially, having the same space).
We can tweak the shared incident score card template to be reusable for WMCS (add notes there for fields that don't make sense, should be reinterpreted differently).

Essentially:

  • Our own "how to handle a page" as that is quite different than SRE (no wikimediastatus.net, no incident coordinator, only few people oncall, ...), this might have some section with "if this is wider than WMCS -> follow SRE process"
  • Shared "how to document an outage", with minor tweaks (hopefully embedded in the shared doc)
  • Shared "how to follow up an incident", with minor tweaks (hopefully embedded in the shared doc)

Pros:

  • Reuses some of the battle tested process, as much as we can (incident documentation and followup)
  • Adapts the most critical and custom parts to our unique use case
  • We get insights from other SREs out of the team, and we give our point of view to others

Cons:

  • Some extra work to keep our own not-shared of the process
  • Some extra maintenance work to go to SRE incident reviews

Option 3 (status quo)

Don't define any Incident Response Process and self-organize on a case-by-base basis.

Pros:

  • No additional work/bureaucracy

Cons:

  • Makes it easier to forget some important steps (e.g. acking the page, updating the status in IRC, writing an incident report, etc...)
  • Time can be lost discussing how to collaborate and how to divide responsibilities
  • Less transparency, as information is less likely to be shared during and after an incident
  • Harder to learn from past incidents, if incidents are resolved without writing reports/documentation

Event Timeline

Please consider this as a draft that we can improve together. Feel free to suggest additional pros/cons, or to say that you don't agree with something I wrote in the description. :)

fnegri renamed this task from Decision Request - Incident response process to Decision Request - Incident Response Process.Oct 13 2023, 5:40 PM

I think that for both option 1 and 2, the current SRE incident guidelines has too many things we might not need (ex. incident coordinator, updating wikimediastatus.net, calling an sre director, ...) to follow as is, so we will probably want to write our own, even if it's following the same structure/review process, etc.

Does this decision request include the incident review + followup items and such? (I'm guessing yes, but there's no explicit mention there).

One thing to keep in mind is even SRE doesn't take every incident through the entire process, which would include a retro. So we could choose to do similar and despite having a process, could choose which incidents to go through the process for or not. Or how far they go through the process.

So to try to keep the subject going, I'll define a bit what 'incident response process' means to me (feel free to tell me otherwise!):

An Incidence response process is a document that defines:

  • How to handle a page -> for the oncall engineer
  • How to document an incident -> for the oncall engineer + the rest of the team
  • How to follow up on an incident -> for the team

I'm going to propose Option 2.1:

Option 2.1

Write a custom Incidence Response Process that is a subset of the one used by the SRE team, with some minor adaptations to our case.
This include having shared incident reviews, and this means us also going to non-WMCS incident reviews, and other SREs coming to ours (essentially, having the same space).
We can tweak the shared incident score card template to be reusable for WMCS (add notes there for fields that don't make sense, should be reinterpreted differently).

Essentially:

  • Our own "how to handle a page" as that is quite different than SRE (no wikimediastatus.net, no incident coordinator, only few people oncall, ...), this might have some section with "if this is wider than WMCS -> follow SRE process"
  • Shared "how to document an outage", with minor tweaks (hopefully embedded in the shared doc)
  • Shared "how to follow up an incident", with minor tweaks (hopefully embedded in the shared doc)

Pros:

  • Reuses some of the battle tested process, as much as we can (incident documentation and followup)
  • Adapts the most critical and custom parts to our unique use case
  • We get insights from other SREs out of the team, and we give our point of view to others

Cons:

  • Some extra work to keep our own not-shared of the process
  • Some extra maintenance work to go to SRE incident reviews

Wdyt? (let me know if this is out of scope/or not what you were aiming to decide on xd)

Getting back to this after a while... I like option 2.1, and I think everything mentioned there is in scope for this task.

fnegri triaged this task as Medium priority.Feb 8 2024, 10:39 AM

In the absence of further comments, I'm proposing the following plan:

  • Resolving this Decision Request choosing option 2.1
  • I will take care of creating the first draft of a new page in wikitech describing how to respond to a WMCS paging alert (a WMCS version of this page)
  • Everyone in WMCS can review and suggest improvements to this new page (via the related Talk page)
  • The page will refer to the existing Google Docs template and Wiki report template. I think we can adopt both these templates with no change (maybe only a note about the Phabricator tags mentioned in the "Actionables" section)
  • WMCS incidents will be added to the list of incidents that are discussed at the existing Incident Review Ritual. As described in that page, "attendance is optional but highly recommended, especially for those that have responded to an incident".
fnegri changed the task status from Open to In Progress.Mar 6 2024, 5:01 PM

I have created a draft document that is a WMCS version of this page:
https://docs.google.com/document/d/1lE2Zq_P5wT6nMDxB_ai-_UVw5et6VVSeLornXKD9K3c/edit#heading=h.oiox6yqrxk2h

I will mention it in the next WMCS weekly meeting and make sure that it's reviewed by everyone who is part of the WMCS on-call rotation.

This is now on-wiki at https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_Response_Process

I will schedule a "dry run" where we can simulate an incident to test the new process. This task will remain open until the "dry run" is done.

The "dry run" is scheduled for April, 29th at 15:00 UTC. The plan is to simulate an incident by shutting down one or more codfw servers. No page will be sent and there will be no impact to end users.

We will try to follow the process creating an Incident Document, choosing an Incident Coordinator, and writing an Incident Report.

I will collect feedback and suggestions after the dry run and use them to improve the Incident Response Process wiki page.

fnegri changed the task status from In Progress to Stalled.Apr 23 2024, 12:50 PM

2024-04-28 [WMCS] Toolforge Redis refusing connections

By chance, we had a real outage yesterday, which was a good chance to test the process. The incident doc is here and the incident report is here.

2024-04-29 [WMCS] ceph outage in codfw1dev drill

As planned, we also did an incident drill today, attended by @aborrero, @Andrew, @dcaro, @fnegri and @taavi. The incident doc is here, we will not create an Incident Report.

Second drill

We decided to do a second drill next week, where an actual alert will fire (but without impact to users). This is scheduled for Tue May, 7th at 15:00 UTC.

fnegri closed this task as Resolved.EditedMay 10 2024, 1:31 PM

We did a second drill this week, the incident doc is here. We should consider doing more drills in the coming months because people attending them found them to be useful.

I am resolving this Decision Request. I created a "Decision record" on wiki at https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T348887_Incident_Response_Process