[EPIC] Address maps level of support issues
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	akosiaris
	Nov 5 2020, 4:30 PM

Description

Intro

Having to deal with a maps related incident after being paged is usually a painful experience. The infrastructure is at its limits capacity wise, SLOs don't exist, the service's ownership is unclear making it difficult or impossible to ask for help. There are no runbooks/cookbooks regarding it and documentation is existent but sparse (see https://wikitech.wikimedia.org/wiki/Maps). Furthermore, alerts often flap between OK and CRITICAL, without any course of action taken by anyone. Being paged late in the night for it is causing morale to drop and feelings of
exasperation. So, SREs want to go forward with a switching off pages to their phones until some of the aforementioned issues are resolved.

Action Items

This is an umbrella task for tracking the various items that would need to be addressed before the service is able to be supported in the same level by SRE again:

A clear ownership of the maps service and infrastructure is established, communicated and documented
The service owner decides on the adoption of SLOs and sets them per their product management decisions (it's worthwhile to point that, that an absence of SLO means that the service can not be supported by SRE)
If SLOs are adopted, proper monitoring for those is established
Sufficient capacity is added to the service in order to be able to fulfill the SLOs set above
Documentation is updated[1]
Runbooks[2] for the most common issues that require a judgement call or are very difficult to automate are created
Cookbooks[3] are created

[1] https://wikitech.wikimedia.org/wiki/Maps
[2] https://wikitech.wikimedia.org/wiki/Runbook
[3] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks

Details

	Subject	Repo	Branch	Lines +/-
	kartotherian: Don't page SREs on failure	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• ssastry	T263854 [Maps] Modernize Vector Tile Infrastructure
Resolved	MSantos	T267339 [EPIC] Address maps level of support issues
Resolved	MSantos	T269884 Empower maps support by providing better documentation

Event Timeline

akosiaris created this task.Nov 5 2020, 4:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 5 2020, 4:30 PM

Some of the above items are optional (e.g. cookbooks if nothing is done often and is automatable) but good to have.

I am not creating subtasks, I 'll let however picks those up do so as many are contingent on the clear ownership action item.

• wkandek subscribed.Nov 5 2020, 7:26 PM

MSantos added a project: Product-Infrastructure-Team-Backlog-Deprecated.Nov 5 2020, 7:57 PM

MSantos added subscribers: • sdkim, Jgiannelos.

TheDJ subscribed.Nov 6 2020, 8:50 AM

Change 639154 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] kartotherian: Don't page SREs on failure

https://gerrit.wikimedia.org/r/639154

Change 639154 merged by Alexandros Kosiaris:
[operations/puppet@production] kartotherian: Don't page SREs on failure

https://gerrit.wikimedia.org/r/639154

Maintenance_bot removed a project: Patch-For-Review.Nov 6 2020, 1:10 PM

Thanks for raising this @akosiaris.
Given the Product Infrastructure team met with SRE and Platform Engineering yesterday, I will be putting together a communication on what the short-term and long-term goals of maps will be.

Just to be explicit, I am the product owner for maps moving forward and will make sure your points are addressed.

• sdkim moved this task from Needs triage to Needs investigation on the Product-Infrastructure-Team-Backlog-Deprecated board.Nov 6 2020, 2:39 PM

That's excellent news @sdkim . Many thanks for this!

Ainali subscribed.Nov 6 2020, 10:02 PM

Abbe98 subscribed.Nov 6 2020, 10:48 PM

mxn subscribed.Nov 8 2020, 4:17 PM

Nemo_bis subscribed.Nov 17 2020, 6:42 PM

edwardbetts subscribed.Nov 21 2020, 10:36 AM

• sdkim added a parent task: T263854: [Maps] Modernize Vector Tile Infrastructure.Dec 8 2020, 3:06 PM

Aklapper mentioned this in T275063: Maps - rearchitecting of Maps Stack to improve stability and reliability.Feb 17 2021, 11:07 PM

MSantos claimed this task.Feb 25 2021, 4:07 PM

MSantos added a subscriber: hnowlan.

MSantos triaged this task as High priority.Sep 9 2021, 1:44 PM

MSantos moved this task from Needs investigation to Upcoming on the Product-Infrastructure-Team-Backlog-Deprecated board.

Jgiannelos closed subtask T269884: Empower maps support by providing better documentation as Resolved.Mar 22 2022, 10:48 AM

Jgiannelos moved this task from Upcoming to Kanban on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 29 2022, 11:18 AM

Jgiannelos edited projects, added Product-Infrastructure-Team-Backlog-Deprecated (Kanban); removed Product-Infrastructure-Team-Backlog-Deprecated.

Jgiannelos moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

It has been a while since this task and we have done most of the items but I'm not sure about the runbook and cookbooks.

Recently, @Jgiannelos, @hnowlan, and @jijiki performed a bunch of maintenance tasks in maps is there anything to add to this task? Are we confident enough to close it?

MSantos closed this task as Resolved.Oct 19 2023, 7:08 PM

[EPIC] Address maps level of support issuesClosed, ResolvedPublicActions