Today we manage SLO dashboards using an in-house jsonnet template which is rendered and deployed to grafana using grafana-grizzly. Since establishing this process, a self contained SLO management tool Pyrra (https://pyrra.dev) has seen much active development and offers several benefits including improved (dedicated) SLO visualization, search, labeling, automation of recording rules, integrated multi burn alerting, and more.
This task initially served as a placeholder to explore this (hence the patch history) and is being expanded to serve as a tracking task for Pyrra deployment and integration.
High level checklist, in rough order:
- Pyrra debian package
- Service puppetization (pyrra-api, pyrra-filesystem)
- Deploy pyrra pilot instance https://pyrra.wikimedia.org
- Identify path for configured duration vs quick view dashboard durations https://github.com/pyrra-dev/pyrra/issues/952
- Until thats addressed upstream we'll use a combination of adjusting the pyrra URI query string to extend the duration, and dashboards using SLO overview metrics exported by pyrra as workarounds
- Puppetize Pyrra SLO configs
- Onboard pilot SLO(s)
- logstash-requests
- varnish-requests
- Identify recording rule backfill process for new SLOs T349521: Prometheus/Pyrra: establish backfill process for recording rules
- Onboard existing SLOs
- etcd
- haproxy
- trafficserver
- WDQS
- linkrecommendation
- lift wing
- logstash-availability
- ORES
- Documentation
- Tests
- Enable alerting
- Monitoring, alerting & potential SLO for pyrra itself
- T351111: Add footer including privacy policy to slo.wikimedia.org (pyrra)