Page MenuHomePhabricator

Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so
Closed, ResolvedPublic

Description

This umbrella task encompasses a few things (possibly deserving of sub-tasks):

  • The SRE team cultivates a good understanding of what kinds and severities of outages merit being [manually] posted on Statuspage
  • The process for doing so is both well-understood (all of: training, documentation, and practice) and easy (boilerplate templates for common scenarios and maintenances, possibly also software tooling for pushbutton opening of incidents)

Re: boilerplate templates, here's a partial list of common scenarios for which we should be well-prepared:
This list is incomplete. You can help by expanding it.

  • Readonly intervals, for brief primary database maintenances (generally <1–5 minutes) and for longer intervals like the regular datacenter switchover (~1hr reserved, even if usually faster)
  • Outages where application servers are saturated or otherwise malfunctioning, which predominantly affects logged-in users and users making edits. Our CDN is relatively good at shielding anonymous users and other cacheable traffic from impact in this scenario.
    • Some flavors of issue will only affect certain wikis, or specific pages on specific wikis, or specific features. Other flavors will affect all logged-in users.
  • Outages where our CDN or our connectivity to parts of the Internet are impacted. This will affect users of any of our services, although depending on the nature of the outage it may only affect users with certain ISPs or from certain geographical areas. One complication that we should be prepared for is that outages of this type will not always be our 'fault', or even actionable on our behalf -- so we probably need a couple different flavors of user messaging.

For each of these scenarios, we should have a few sentences of standardized text ready and uploaded into the Incident templates feature on Statuspage.

Event Timeline

Peachey88 renamed this task from SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so to Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so.Jun 29 2021, 9:08 PM
herron triaged this task as Medium priority.Jul 1 2021, 5:29 PM

As of yesterday, instructions have been shared with the SRE team and access has been granted. This task is either "done" or "in need of review of how we did in ~a quarter from now" depending on how you think about it.

@CDanis I think this is probably good to close, we can always reopen if there are concerns.