Page MenuHomePhabricator

Create WDQS Lag SLO dashboard with Grizzly && documentation
Closed, ResolvedPublic5 Estimated Story Points

Description

We currently have a manually created dashboard for W[CD]QS Update lag SLO. Since we have standardized our SLO dashboards with Grizzly templates, it makes sense to also migrate this dashboard to Grizzly. This will make SLOs more discoverable, more uniform and will include update lag in our standard SLO reporting.

AC:

Event Timeline

MPhamWMF moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.
RKemper renamed this task from Create WDQS Lag SLO dashboard with Grizzly to Create WDQS Lag SLO dashboard with Grizzly && documentation.Feb 1 2023, 5:50 PM
RKemper updated the task description. (Show Details)

Fully deployed latest (working) dashboard:

ryankemper@grafana1002:~/grafana-grizzly$ grr diff slo_dashboards.jsonnet
Dashboard/slo-WDQS changes detected:
--- Remote
+++ Local
@@ -86,7 +86,7 @@
           "span": 6,
           "targets": [
             {
-              "expr": "100 * (1 - (1 - sum by (job) (increase(trafficserver_backend_requests_seconds_count{status=~\"200|403|429\",site=~\"$site\",backend=\"wdqs.discovery.wmnet\"}[90d])) / sum by (job) (increase (trafficserver_backend_requests_seconds_count{site=~\"$site\",backend=\"wdqs.discovery.wmnet\"}[90d]))) / .05)",
+              "expr": "100 * (1 - (1 - job_site:sli_wdqs_req_success_ratio:increase90d{site=~\"$site\"}) / .05)",
               "format": "time_series",
               "instant": true,
               "intervalFactor": 2,
@@ -144,7 +144,7 @@
           "steppedLine": false,
           "targets": [
             {
-              "expr": "1 - sum by (job) (increase(trafficserver_backend_requests_seconds_count{status=~\"200|403|429\",site=~\"$site\",backend=\"wdqs.discovery.wmnet\"}[90d]) / increase (trafficserver_backend_requests_seconds_count{site=~\"$site\",backend=\"wdqs.discovery.wmnet\"}[90d]))",
+              "expr": "1 - job_site:sli_wdqs_req_success_ratio:increase90d{site=~\"$site\"}",
               "format": "time_series",
               "intervalFactor": 2,
               "legendFormat": "{{site}}",

Dashboard/slo-apigw no differences
Dashboard/slo-Etcd no differences
Dashboard/slo-HAProxy no differences
Dashboard/slo-Logstash no differences
Dashboard/slo-Trafficserver no differences
Dashboard/slo-Varnish no differences
ryankemper@grafana1002:~/grafana-grizzly$ grr apply slo_dashboards.jsonnet
Dashboard/slo-Logstash no differences
Dashboard/slo-Trafficserver no differences
Dashboard/slo-Varnish no differences
Dashboard/slo-WDQS updated
Dashboard/slo-apigw no differences
Dashboard/slo-Etcd no differences
Dashboard/slo-HAProxy no differences

Visible at https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1

Re-opening, this ticket is about the update lag SLO, not the uptime SLO

RKemper changed the point value for this task from 3 to 5.May 22 2023, 6:24 PM

Documentation aspect of this ticket's already done. Basically two things left to do to close this ticket out:

  • Create the grizzly dashboard

Barring any significant unexpected developments, this should be achievable within this week.

Change 933172 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/grafana-grizzly@master] [WIP] Dashboard for query service update lag

https://gerrit.wikimedia.org/r/933172

Change 933172 merged by Ryan Kemper:

[operations/grafana-grizzly@master] Dashboard for wdqs update lag

https://gerrit.wikimedia.org/r/933172