Page MenuHomePhabricator

[Maps] Modernize Vector Tile Infrastructure
Closed, ResolvedPublic1 Estimated Story Points

Description

Problem

Wikimedia Maps has the most number of outages of any service at the foundation to date. As those outages occur, there is no defined performance metric to indicate success or service degradation; our current monitoring capabilities for maps are poor and unhelpful; and it is very complex and not easily understood in order to gain support and maintenance.

Hypothesis

We believe that modernizing the maps infrastructure will reduce complexity, enable monitoring capabilities, and better empower SRE to resolve issues quickly and intuitively.

The overall rationale behind the following phased approach is to be able to do atomic changes without breaking the current functionality and with minimal disruption. We also want to allow for evaluation of the changes and enable feedback to be provided along the way. The plan is to approach this modernization iteratively and starting with replacing our current vector tile server, tilerator, with the open-source vector tile server, Tegola. We hope moving away from server-side raster rendering to client-side reducing dependency on SRE and allows this team to be autonomous when it comes to supporting and maintaining the maps stack.

By modernizing our maps infrastructure, we empower SREs to support maps-related incidents and maintenance by

  • Moving away from static allocation of services to bare metal to services in Kubernetes
  • Reduce the complexity of the infrastructure by removing legacy/deprecated dependencies
  • Use technologies where our SREs have a lot of expertise

Outcomes

Wikimedia users will have a reliable and consistent experience contributing to and learning about geo-information
  • Maintenance effort on vector-tile related outages reduces to 10% per quarter for a full-time engineer
SREs will be empowered to maintain and support maps-related incidents without previous experience
  • Performance metrics can be monitored through Prometheus in providing alerting thresholds as defined by SLO
  • SLOs are defined and agreed upon by the Product Infrastructure & SRE teams
  • Maps documentation can be read and clearly understood by an SRE to provide an overview and provide actionable remedies to handle problems

Related Objects

StatusSubtypeAssignedTask
Resolvedssastry
Declinedhnowlan
ResolvedSpikeJgiannelos
ResolvedSpikeJgiannelos
ResolvedSpikeJgiannelos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
Resolvedhnowlan
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
Resolvedhnowlan
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedMSantos
ResolvedNone
ResolvedJgiannelos
ResolvedMSantos
ResolvedMSantos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedMSantos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedMSantos
ResolvedMSantos
Resolvedhnowlan
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
Resolvedjijiki
Resolvedssastry

Event Timeline

Naike set the point value for this task to 1.
Naike removed the point value for this task.Dec 7 2020, 6:42 PM
Naike set the point value for this task to 1.
sdkim renamed this task from [Maps] Improve Service Consistency & Reduce Maintenance Cost to [Maps] Modernize Vector Tile Infrastructure.Jan 11 2021, 9:37 PM
sdkim claimed this task.
sdkim updated the task description. (Show Details)
sdkim added a subscriber: hnowlan.

Change 776278 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/services/kartotherian@master] Remove a reference to Tilerator

https://gerrit.wikimedia.org/r/776278

Change 776278 abandoned by WMDE-Fisch:

[mediawiki/services/kartotherian@master] Remove a reference to Tilerator

Reason:

This patch seems to take care of the issue: If62089bfe77f6cae1083017d53b5cffcd0952c5b

https://gerrit.wikimedia.org/r/776278

For future reference, I've removed the client-side rendering tasks from this EPIC since they were removed from the scope a long time ago and with that we can finally close this task.