Page MenuHomePhabricator

[Maps] Modernize Vector Tile Infrastructure
Open, Needs TriagePublic1 Estimated Story Points

Description

Problem

Wikimedia Maps has the most number of outages of any service at the foundation to date. As those outages occur, there is no defined performance metric to indicate success or service degradation; our current monitoring capabilities for maps are poor and unhelpful; and it is very complex and not easily understood in order to gain support and maintenance.

Hypothesis

We believe that modernizing the maps infrastructure will reduce complexity, enable monitoring capabilities, and better empower SRE to resolve issues quickly and intuitively.

The overall rationale behind the following phased approach is to be able to do atomic changes without breaking the current functionality and with minimal disruption. We also want to allow for evaluation of the changes and enable feedback to be provided along the way. The plan is to approach this modernization iteratively and starting with replacing our current vector tile server, tilerator, with the open-source vector tile server, Tegola. We hope moving away from server-side raster rendering to client-side reducing dependency on SRE and allows this team to be autonomous when it comes to supporting and maintaining the maps stack.

By modernizing our maps infrastructure, we empower SREs to support maps-related incidents and maintenance by

  • Moving away from static allocation of services to bare metal to services in Kubernetes
  • Reduce the complexity of the infrastructure by removing legacy/deprecated dependencies
  • Use technologies where our SREs have a lot of expertise

Outcomes

Wikimedia users will have a reliable and consistent experience contributing to and learning about geo-information
  • Maintenance effort on vector-tile related outages reduces to 10% per quarter for a full-time engineer
SREs will be empowered to maintain and support maps-related incidents without previous experience
  • Performance metrics can be monitored through Prometheus in providing alerting thresholds as defined by SLO
  • SLOs are defined and agreed upon by the Product Infrastructure & SRE teams
  • Maps documentation can be read and clearly understood by an SRE to provide an overview and provide actionable remedies to handle problems

Related Objects

StatusSubtypeAssignedTask
Openssastry
Openhnowlan
ResolvedSpikeJgiannelos
ResolvedSpikeJgiannelos
ResolvedSpikeJgiannelos
OpenMSantos
OpenMSantos
OpenNone
ResolvedMSantos
Resolvedhnowlan
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
OpenJgiannelos
ResolvedJgiannelos
OpenMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
OpenGpeterson
OpenGpeterson
OpenGpeterson
OpenNone
Openhnowlan
ResolvedJgiannelos
OpenNone
ResolvedMSantos
OpenNone
OpenJgiannelos
OpenMSantos
ResolvedMSantos
ResolvedJgiannelos
OpenNone
OpenJgiannelos
OpenMSantos
OpenNone
ResolvedJgiannelos
OpenNone

Event Timeline

Naike set the point value for this task to 1.
Naike removed the point value for this task.Dec 7 2020, 6:42 PM
Naike set the point value for this task to 1.
sdkim moved this task from Next to Now on the Product Infrastructure Roadmap board.
sdkim renamed this task from [Maps] Improve Service Consistency & Reduce Maintenance Cost to [Maps] Modernize Vector Tile Infrastructure.Jan 11 2021, 9:37 PM
sdkim claimed this task.
sdkim updated the task description. (Show Details)
sdkim added a subscriber: hnowlan.