This umbrella task encompasses a few things (possibly deserving of sub-tasks):
- The SRE team cultivates a good understanding of what kinds and severities of outages merit being [manually] posted on Statuspage
- The process for doing so is both well-understood (all of: training, documentation, and practice) and easy (boilerplate templates for common scenarios and maintenances, possibly also software tooling for pushbutton opening of incidents)
Re: boilerplate templates, here's a partial list of common scenarios for which we should be well-prepared:
This list is incomplete. You can help by expanding it.
- Readonly intervals, for brief primary database maintenances (generally <1–5 minutes) and for longer intervals like the regular datacenter switchover (~1hr reserved, even if usually faster)
- Outages where application servers are saturated or otherwise malfunctioning, which predominantly affects logged-in users and users making edits. Our CDN is relatively good at shielding anonymous users and other cacheable traffic from impact in this scenario.
- Some flavors of issue will only affect certain wikis, or specific pages on specific wikis, or specific features. Other flavors will affect all logged-in users.
- Outages where our CDN or our connectivity to parts of the Internet are impacted. This will affect users of any of our services, although depending on the nature of the outage it may only affect users with certain ISPs or from certain geographical areas. One complication that we should be prepared for is that outages of this type will not always be our 'fault', or even actionable on our behalf -- so we probably need a couple different flavors of user messaging.
For each of these scenarios, we should have a few sentences of standardized text ready and uploaded into the Incident templates feature on Statuspage.