Today we manage SLO dashboards using an in-house jsonnet template which is rendered and deployed to grafana using grafana-grizzly. Since establishing this process, a self contained SLO management tool Pyrra (https://pyrra.dev) has seen much active development and offers several benefits including improved (dedicated) SLO visualization, search, labeling, automation of recording rules, integrated multi burn alerting, and more.
This task initially served as a placeholder to explore this (hence the patch history) and is being expanded to serve as a tracking task for Pyrra deployment and integration.
High level checklist, in rough order:
- Pyrra debian package
- Service puppetization (pyrra-api, pyrra-filesystem)
- Deploy pyrra pilot instance https://pyrra.wikimedia.org
- Identify best path for configured duration vs quick view dashboard durations https://github.com/pyrra-dev/pyrra/issues/952
- Puppetize Pyrra SLO configs
- Onboard pilot SLO(s)
- logstash-requests
- varnish-requests (tentative)
- lift wing (tentative)
- Identify recording rule backfill process for new SLOs T349521: Prometheus/Pyrra: establish backfill process for recording rules
- Onboard existing SLOs fully
- Documentation
- Tests
- Enabling alerting
- Monitoring, alerting & potential SLO for pyrra itself
- T351111: Add footer including privacy policy to slo.wikimedia.org (pyrra)