Decision request - Team oncall, alerting, schedules and processes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Jun 14 2022, 11:05 AM

Description

Problem

During infrastructure instability periods, the current oncall schedules and processes due to optimizing for breadth of attention, end up disturbing many people and not giving a clear way for action for those disturbed, relying on informal communication and channels.

Constraints and risks

We should try to wake up people as little as possible
We should try to interrupt people as little as possible

Decision record

In progress

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T310598_Team_oncall_alerting_schedules_and_processes

Options

Option 1

Create an "oncall duty" rotating role, where one team member for each time zone has the responsibilities during their rotation:

they are the first to get paged
their main responsibility is to deal with interruptions (pages, alerts, broken things...)
their second responsibility is to improve the oncall/alert (improve alerts, improve runbooks, cookbooks, add stability features to the system, ...)

This could be paired with the current "clinic duty".

Proposed schedule and zones:

Zone1:
From 15:00 UTC to 3:00 UTC
Members: Andrew, Nicholas?, RooK
Zone2:
From 3:00 UTC to 15:00 UTC
Members: Arturo, David

Rotating every week on Wednesday (team meeting day).

Alert duty gets first page
10 min after, the rest of the Zone gets paged
10 min after, the other Zone gets paged

Create an alerting best practices and move current alerts to it:

If it needs immediate attention -> page+task+email+irc
If it needs attention, but not immediate -> task+email+irc
If it does not need attention (essentially, for debugging/knowing the system status) -> irc or remove
Always create a runbook when you create an alert

Create also a common process for pages/alert handling:

When a page happens:
- If you don't have a laptop around yet and can act on it -> acknowledge on splunk/victorops
- If you have a laptop around and can act on it -> silence on the source (for now, that will become alertmanager)
  - If the alert has source=icinga on alertmanager -> ack on icinga, click on the "X hours ago" text on alertmanager to see the link (the wikimedia.org one):

alertmanager_link_to_icinga (176×284 px, 19 KB)

Otherwise ack on alertmanager (the tick button, see below, or a silence that starts with 'ACK! ')

If you can't act on it, let it page other people
Do anything that needs doing to get rid of the urgency
- If you need help, ping people you think might help, if not sure, ping another teammate
Make sure it has an associated task and populate with what you found/did and next steps
If the alert is not gone yet, ack the alert on alertmanager (not icinga) and attach the task id
If you acked on victorops, resolve on victorops too (otherwise it will page again after 1d)

When an alert without a page happens:
- Make sure it has an associated task (this can be automated for alertmanager tasks)
- Ack on alertmanager (not icinga, only alertmanager) and add the task id to the ack comment

(this can be automated with a cookbook, something like cookbook wmcs.ack_page "subject of the page" or show a list of active alerts to choose or such)

Pros-cons

Pros:

This minimizes the interruptions for the rest of the team
This makes sure we invest into stabilizing/operationalizing the current systems
Every team member will have (limited) exposure to parts of the infrastructure they don't usually work on, increasing knowledge sharing
Lowers the total amount of out-of-hour pages
Makes it clearer where to look and how to communicate in case of a page

Cons:

The current size of the team makes it that sometimes we would be oncall half of the days (though opposed to all days as it is now I think it's an improvement)
Sometimes (specially at the beginning) the person paged might not know how to handle the page

Option 2

Do nothing.

Pros:

No new effort needs doing

Cons:

Nothing improves, but effectively deteriorates (alert fatigue, false alerts, false pages, ...)

Related Objects

Mentioned In: T313444: Streamline WMCS Alerting and Paging

Event Timeline

dcaro created this task.Jun 14 2022, 11:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2022, 11:05 AM

dcaro renamed this task from Decision request - Team oncall sechdules and process to Decision request - Team oncall scehdules and process.Jun 14 2022, 11:06 AM

dcaro renamed this task from Decision request - Team oncall scehdules and process to Decision request - Team oncall, alerting, scehdules and processes.

dcaro updated the task description. (Show Details)

Aklapper renamed this task from Decision request - Team oncall, alerting, scehdules and processes to Decision request - Team oncall, alerting, schedules and processes.Jun 14 2022, 11:37 AM

dcaro updated the task description. (Show Details)Jun 14 2022, 5:07 PM

dcaro updated the task description. (Show Details)Jun 14 2022, 5:11 PM

dcaro added subscribers: rook, Andrew, aborrero, • nskaggs.Jun 25 2022, 12:42 PM

This was approved in the team meeting of 29/06/2022, will create the record shortly

dcaro closed this task as Resolved.Jul 1 2022, 2:17 PM

• nskaggs mentioned this in T313444: Streamline WMCS Alerting and Paging.Jul 20 2022, 9:23 PM