We use Grizzly to create org-wide SLO dashboards.
AC:
- Dashboard for WDQS uptime: https://grafana.wikimedia.org/d/slo-wdqs-tmpl/wdqs-slos-grizzly-template?orgId=1
We use Grizzly to create org-wide SLO dashboards.
AC:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | RKemper | T313751 Create WDQS uptime SLO | |||
Resolved | RKemper | T323064 Create WDQS Uptime SLO dashboard in Grizzly | |||
Resolved | RKemper | T328306 Thanos rule evaluation alerts for service_slis |
Change 862178 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/grafana-grizzly@master] [WIP] add grizzly dashboard for WDQS uptime
Change 862178 merged by Ryan Kemper:
[operations/grafana-grizzly@master] add grizzly dashboard for WDQS uptime
Mentioned in SAL (#wikimedia-operations) [2022-12-08T20:17:23Z] <ryankemper> T323064 Merged https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/862178 and deployed new dashboard, visible here: https://grafana.wikimedia.org/d/slo-wdqs-tmpl/wdqs-slos-grizzly-template?orgId=1
ryankemper@grafana1002:/srv/grafana-grizzly$ grr diff slo_dashboards.jsonnet Dashboard/slo-logstash-tmpl no differences Dashboard/slo-trafficserver-tmpl no differences Dashboard/slo-varnish-tmpl no differences Dashboard/slo-wdqs-tmpl not present in Dashboard Dashboard/slo-apigw no differences Dashboard/slo-etcd-tmpl no differences Dashboard/slo-haproxy-tmpl no differences
ryankemper@grafana1002:/srv/grafana-grizzly$ grr apply slo_dashboards.jsonnet Dashboard/slo-logstash-tmpl no differences Dashboard/slo-trafficserver-tmpl no differences Dashboard/slo-varnish-tmpl no differences Dashboard/slo-wdqs-tmpl added Dashboard/slo-apigw no differences Dashboard/slo-etcd-tmpl no differences Dashboard/slo-haproxy-tmpl no differences
Looking at https://grafana.wikimedia.org/d/slo-wdqs-tmpl/wdqs-slos-grizzly-template?orgId=1&var-datasource=thanos&var-site=All&var-cluster=All, I only see data for "Request Error Budget Remaining". The other graphs don't seem to be timing out, so I suspect there is an issue with the queries. I have not investigated further.
Moving this ticket back to needs review.
Change 867695 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/grafana-grizzly@master] wdqs: fix request request error ratio sli pane
Change 879599 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] [WIP] wdqs: add recording rule for req success ratio
Change 879606 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/grafana-grizzly@master] [WIP] wdqs: use pre-computed wdqs recording rules
Change 879599 merged by Ryan Kemper:
[operations/puppet@production] wdqs: add recording rule for req success ratio
above patch was reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/883223/
Change 883610 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: add recording rule for req success ratio
Change 883610 merged by Ryan Kemper:
[operations/puppet@production] wdqs: add recording rule for req success ratio
Change 879606 merged by Ryan Kemper:
[operations/grafana-grizzly@master] wdqs: use pre-computed wdqs recording rules
Change 867695 abandoned by Ryan Kemper:
[operations/grafana-grizzly@master] wdqs: fix request request error ratio sli pane
Reason:
obsoleted by https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/879606
Change 912944 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/grafana-grizzly@master] wdqs: make uptime sli a %
Change 912944 merged by Ryan Kemper:
[operations/grafana-grizzly@master] wdqs: make uptime sli a %
Forgot to link patch but here's the (hopefully final) grizzly patch to get this where we want it: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938
(patch already merged & deployed)