Intro
Having to deal with a maps related incident after being paged is usually a painful experience. The infrastructure is at its limits capacity wise, SLOs don't exist, the service's ownership is unclear making it difficult or impossible to ask for help. There are no runbooks/cookbooks regarding it and documentation is existent but sparse (see https://wikitech.wikimedia.org/wiki/Maps). Furthermore, alerts often flap between OK and CRITICAL, without any course of action taken by anyone. Being paged late in the night for it is causing morale to drop and feelings of
exasperation. So, SREs want to go forward with a switching off pages to their phones until some of the aforementioned issues are resolved.
Action Items
This is an umbrella task for tracking the various items that would need to be addressed before the service is able to be supported in the same level by SRE again:
- A clear ownership of the maps service and infrastructure is established, communicated and documented
- The service owner decides on the adoption of SLOs and sets them per their product management decisions (it's worthwhile to point that, that an absence of SLO means that the service can not be supported by SRE)
- If SLOs are adopted, proper monitoring for those is established
- Sufficient capacity is added to the service in order to be able to fulfill the SLOs set above
- Documentation is updated[1]
- Runbooks[2] for the most common issues that require a judgement call or are very difficult to automate are created
- Cookbooks[3] are created
[1] https://wikitech.wikimedia.org/wiki/Maps
[2] https://wikitech.wikimedia.org/wiki/Runbook
[3] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks