Page MenuHomePhabricator

Design and implement SLO Dashboard tooling
Closed, ResolvedPublic

Description

The observability team has been working with the SLO work group and determined that its time to implement some formal tooling. The plan is to establish either a template or a dashboard config generator that would help ease the adoption of SLOs within technology and product teams at WMF.

This ticket serves as the main task to wrap efforts round SLO Dashboard tooling. Overall the main requirements are that we leverage the current stack available and build on existing tools, Grafana + Prometheus is probably the best place to start. In addition to this we would want to be able to:

  • Define time periods to observe SLOs (days/week/quarters/years)
  • Have multiple visualization options (Gauges/Burn Down/Histogram/Line Chart) views depending on the nature of the metric
  • Define standards (format and naming conventions for SLI metrics)
  • Easy to adopt (dashboard config/markup generation)
  • List dashboards in a central place, make them easy to be found, make them operationally valuable (also good for troubleshooting)
  • Provide reporting and alerting to let SLO owner know if thresholds are going to be breached within a defined time period.
  • Be able to explicitly list and state SLIs and calculations made to produce an SLO target.

Event Timeline

mentioning (but not yet linking) some pre-existing SLO tasks T258754 T254916 T256629 T263792

This task has not received the updates it deserved, but the work has been done (with the exception of alerting, explained below) via the deployment of grafana grizzly https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Grizzly and the grafonnet SLO dashboard templates that it is deploying.

While we've laid the groundwork to support alerting by centralizing the SLO metric queries, targets, etc. in jsonnet, I think we should track the process of defining and enabling SLO monitoring/alerting as its own related task.