As a user and a maintainer of WDQS, I want an expectation of service availability so that I know when issues can/should be resolved.
The WDQS uptime SLO will be based on running a set of non-cached representative test queries periodically on the WDQS cluster, and comparing the time it takes for the queries to run against the baseline expectation; if this test time is over a TBD threshold, WDQS will be considered down, and require maintenance. This should approximate actual service availability for users. The tests will be non-cached and run against the entire cluster rather than per host.
Example -- test queries are run hourly should not take more than 200% time to run over baseline (<200ms if baseline is 100ms). Goal is to keep this uptime 95% of the time.
Sub-tasks:
- Decided on SLO => trafficserver_backend metrics
- WDQS SLO comms have been sent out
- Implemented trafficserver metrics to see SLO performance
AC:
- SLO for WDQS uptime is established
- SLO for WDQS uptime is viewable on WDQS dashboard => https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&from=now-90d&to=now <-> https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1
- Any necessary changes to alerting (reducing paging etc if necessary) have been made =>
- New service expectations are socialized with broader SRE team