Problem
Wikimedia Maps has the most number of outages of any service at the foundation to date. As those outages occur, there is no defined performance metric to indicate success or service degradation; our current monitoring capabilities for maps are poor and unhelpful; and it is very complex and not easily understood in order to gain support and maintenance.
Hypothesis
We believe that modernizing the maps infrastructure will reduce complexity, enable monitoring capabilities, and better empower SRE to resolve issues quickly and intuitively.
The overall rationale behind the following phased approach is to be able to do atomic changes without breaking the current functionality and with minimal disruption. We also want to allow for evaluation of the changes and enable feedback to be provided along the way. The plan is to approach this modernization iteratively and starting with replacing our current vector tile server, tilerator, with the open-source vector tile server, Tegola. We hope moving away from server-side raster rendering to client-side reducing dependency on SRE and allows this team to be autonomous when it comes to supporting and maintaining the maps stack.
By modernizing our maps infrastructure, we empower SREs to support maps-related incidents and maintenance by
- Moving away from static allocation of services to bare metal to services in Kubernetes
- Reduce the complexity of the infrastructure by removing legacy/deprecated dependencies
- Use technologies where our SREs have a lot of expertise
Outcomes
Wikimedia users will have a reliable and consistent experience contributing to and learning about geo-information
- Maintenance effort on vector-tile related outages reduces to 10% per quarter for a full-time engineer
SREs will be empowered to maintain and support maps-related incidents without previous experience
- Performance metrics can be monitored through Prometheus in providing alerting thresholds as defined by SLO
- SLOs are defined and agreed upon by the Product Infrastructure & SRE teams
- Maps documentation can be read and clearly understood by an SRE to provide an overview and provide actionable remedies to handle problems