Page MenuHomePhabricator

[EPIC] Address maps level of support issues
Closed, ResolvedPublic

Description

Intro

Having to deal with a maps related incident after being paged is usually a painful experience. The infrastructure is at its limits capacity wise, SLOs don't exist, the service's ownership is unclear making it difficult or impossible to ask for help. There are no runbooks/cookbooks regarding it and documentation is existent but sparse (see https://wikitech.wikimedia.org/wiki/Maps). Furthermore, alerts often flap between OK and CRITICAL, without any course of action taken by anyone. Being paged late in the night for it is causing morale to drop and feelings of
exasperation. So, SREs want to go forward with a switching off pages to their phones until some of the aforementioned issues are resolved.

Action Items

This is an umbrella task for tracking the various items that would need to be addressed before the service is able to be supported in the same level by SRE again:

  • A clear ownership of the maps service and infrastructure is established, communicated and documented
  • The service owner decides on the adoption of SLOs and sets them per their product management decisions (it's worthwhile to point that, that an absence of SLO means that the service can not be supported by SRE)
  • If SLOs are adopted, proper monitoring for those is established
  • Sufficient capacity is added to the service in order to be able to fulfill the SLOs set above
  • Documentation is updated[1]
  • Runbooks[2] for the most common issues that require a judgement call or are very difficult to automate are created
  • Cookbooks[3] are created

[1] https://wikitech.wikimedia.org/wiki/Maps
[2] https://wikitech.wikimedia.org/wiki/Runbook
[3] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks

Event Timeline

Some of the above items are optional (e.g. cookbooks if nothing is done often and is automatable) but good to have.

I am not creating subtasks, I 'll let however picks those up do so as many are contingent on the clear ownership action item.

Change 639154 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] kartotherian: Don't page SREs on failure

https://gerrit.wikimedia.org/r/639154

Change 639154 merged by Alexandros Kosiaris:
[operations/puppet@production] kartotherian: Don't page SREs on failure

https://gerrit.wikimedia.org/r/639154

Thanks for raising this @akosiaris.
Given the Product Infrastructure team met with SRE and Platform Engineering yesterday, I will be putting together a communication on what the short-term and long-term goals of maps will be.

Just to be explicit, I am the product owner for maps moving forward and will make sure your points are addressed.

That's excellent news @sdkim . Many thanks for this!

MSantos updated the task description. (Show Details)
MSantos added a subscriber: jijiki.

It has been a while since this task and we have done most of the items but I'm not sure about the runbook and cookbooks.

Recently, @Jgiannelos, @hnowlan, and @jijiki performed a bunch of maintenance tasks in maps is there anything to add to this task? Are we confident enough to close it?