Page MenuHomePhabricator

Migrate existing SLO related metrics to recording rules
Open, MediumPublic

Description

Today our SLO dashboards query SLI metrics directly from our existing metrics. We can help simplify and shorten these queries, and likely improve performance by moving to recording rules. At the same time we can begin to settle on a common naming scheme for our SLO related metrics.

Considering these are often 90d+ queries, a strategy to backfill metrics from recording rules would be ideal as well, although let's track that independently and not let it block this work.

Event Timeline

Change 714814 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: add recording rules for etcd error slo

https://gerrit.wikimedia.org/r/714814

Change 714814 merged by Herron:

[operations/puppet@production] thanos: add recording rules for etcd error slo

https://gerrit.wikimedia.org/r/714814

Change 716535 had a related patch set uploaded (by Herron; author: Herron):

[operations/grafana-grizzly@master] slo_dashboard: switch etcd request slo query to recording rule metrics

https://gerrit.wikimedia.org/r/716535

Change 717473 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos::rule: add cluster_site:sli_etcd_http_error_ratio:rate5m recording rule

https://gerrit.wikimedia.org/r/717473

Change 716535 merged by Herron:

[operations/grafana-grizzly@master] slo_dashboard: switch etcd request slo query to recording rule metrics

https://gerrit.wikimedia.org/r/716535

lmata triaged this task as Medium priority.Thu, Sep 30, 8:44 PM