The observability team has been working with the SLO work group and determined that its time to implement some formal tooling. The plan is to establish either a template or a dashboard config generator that would help ease the adoption of SLOs within technology and product teams at WMF.
This ticket serves as the main task to wrap efforts round SLO Dashboard tooling. Overall the main requirements are that we leverage the current stack available and build on existing tools, Grafana + Prometheus is probably the best place to start. In addition to this we would want to be able to:
- Define time periods to observe SLOs (days/week/quarters/years)
- Have multiple visualization options (Gauges/Burn Down/Histogram/Line Chart) views depending on the nature of the metric
- Define standards (format and naming conventions for SLI metrics)
- Easy to adopt (dashboard config/markup generation)
- List dashboards in a central place, make them easy to be found, make them operationally valuable (also good for troubleshooting)
- Provide reporting and alerting to let SLO owner know if thresholds are going to be breached within a defined time period.
- Be able to explicitly list and state SLIs and calculations made to produce an SLO target.