Page MenuHomePhabricator

Decision request - Team oncall, alerting, schedules and processes
Closed, ResolvedPublic

Description

Problem

During infrastructure instability periods, the current oncall schedules and processes due to optimizing for breadth of attention, end up disturbing many people and not giving a clear way for action for those disturbed, relying on informal communication and channels.

Constraints and risks

  • We should try to wake up people as little as possible
  • We should try to interrupt people as little as possible

Decision record

In progress

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T310598_Team_oncall_alerting_schedules_and_processes

Options

Option 1

Create an "oncall duty" rotating role, where one team member for each time zone has the responsibilities during their rotation:

  • they are the first to get paged
  • their main responsibility is to deal with interruptions (pages, alerts, broken things...)
  • their second responsibility is to improve the oncall/alert (improve alerts, improve runbooks, cookbooks, add stability features to the system, ...)

This could be paired with the current "clinic duty".

Proposed schedule and zones:

Zone1:
From 15:00 UTC to 3:00 UTC
Members: Andrew, Nicholas?, RooK
Zone2:
From 3:00 UTC to 15:00 UTC
Members: Arturo, David

Rotating every week on Wednesday (team meeting day).

Alert duty gets first page
10 min after, the rest of the Zone gets paged
10 min after, the other Zone gets paged

Create an alerting best practices and move current alerts to it:
  • If it needs immediate attention -> page+task+email+irc
  • If it needs attention, but not immediate -> task+email+irc
  • If it does not need attention (essentially, for debugging/knowing the system status) -> irc or remove
  • Always create a runbook when you create an alert
Create also a common process for pages/alert handling:
  • When a page happens:
    • If you don't have a laptop around yet and can act on it -> acknowledge on splunk/victorops
    • If you have a laptop around and can act on it -> silence on the source (for now, that will become alertmanager)
      • If the alert has source=icinga on alertmanager -> ack on icinga, click on the "X hours ago" text on alertmanager to see the link (the wikimedia.org one):

alertmanager_link_to_icinga (176×284 px, 19 KB)

  • Otherwise ack on alertmanager (the tick button, see below, or a silence that starts with 'ACK! ')

alertmanager_ack (44×84 px, 1 KB)

  • If you can't act on it, let it page other people
  • Do anything that needs doing to get rid of the urgency
    • If you need help, ping people you think might help, if not sure, ping another teammate
  • Make sure it has an associated task and populate with what you found/did and next steps
  • If the alert is not gone yet, ack the alert on alertmanager (not icinga) and attach the task id
  • If you acked on victorops, resolve on victorops too (otherwise it will page again after 1d)
  • When an alert without a page happens:
    • Make sure it has an associated task (this can be automated for alertmanager tasks)
    • Ack on alertmanager (not icinga, only alertmanager) and add the task id to the ack comment

(this can be automated with a cookbook, something like cookbook wmcs.ack_page "subject of the page" or show a list of active alerts to choose or such)

Pros-cons

Pros:

  • This minimizes the interruptions for the rest of the team
  • This makes sure we invest into stabilizing/operationalizing the current systems
  • Every team member will have (limited) exposure to parts of the infrastructure they don't usually work on, increasing knowledge sharing
  • Lowers the total amount of out-of-hour pages
  • Makes it clearer where to look and how to communicate in case of a page

Cons:

  • The current size of the team makes it that sometimes we would be oncall half of the days (though opposed to all days as it is now I think it's an improvement)
  • Sometimes (specially at the beginning) the person paged might not know how to handle the page

Option 2

Do nothing.

Pros:

  • No new effort needs doing

Cons:

  • Nothing improves, but effectively deteriorates (alert fatigue, false alerts, false pages, ...)

Event Timeline

dcaro renamed this task from Decision request - Team oncall sechdules and process to Decision request - Team oncall scehdules and process.Jun 14 2022, 11:06 AM
dcaro renamed this task from Decision request - Team oncall scehdules and process to Decision request - Team oncall, alerting, scehdules and processes.
dcaro updated the task description. (Show Details)
dcaro updated the task description. (Show Details)
Aklapper renamed this task from Decision request - Team oncall, alerting, scehdules and processes to Decision request - Team oncall, alerting, schedules and processes.Jun 14 2022, 11:37 AM

This was approved in the team meeting of 29/06/2022, will create the record shortly