Page MenuHomePhabricator

Migrate existing SLO related metrics to recording rules
Open, MediumPublic

Description

Today our SLO dashboards query SLI metrics directly from our existing metrics. We can help simplify and shorten these queries, and likely improve performance by moving to recording rules. At the same time we can begin to settle on a common naming scheme for our SLO related metrics.

Considering these are often 90d+ queries, a strategy to backfill metrics from recording rules would be ideal as well, although let's track that independently and not let it block this work.

Event Timeline

Change 714814 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: add recording rules for etcd error slo

https://gerrit.wikimedia.org/r/714814

Change 714814 merged by Herron:

[operations/puppet@production] thanos: add recording rules for etcd error slo

https://gerrit.wikimedia.org/r/714814

Change 716535 had a related patch set uploaded (by Herron; author: Herron):

[operations/grafana-grizzly@master] slo_dashboard: switch etcd request slo query to recording rule metrics

https://gerrit.wikimedia.org/r/716535

Change 717473 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos::rule: add cluster_site:sli_etcd_http_error_ratio:rate5m recording rule

https://gerrit.wikimedia.org/r/717473

Change 716535 merged by Herron:

[operations/grafana-grizzly@master] slo_dashboard: switch etcd request slo query to recording rule metrics

https://gerrit.wikimedia.org/r/716535

lmata triaged this task as Medium priority.Sep 30 2021, 8:44 PM

Change 740209 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos: add recording rules for varnish SLO

https://gerrit.wikimedia.org/r/740209

Change 914945 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] thanos: Migrate from 100-scale to unit-scale SLO recording rules

https://gerrit.wikimedia.org/r/914945

Change 914946 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/grafana-grizzly@master] Migrate from 100-scale to unit-scale SLO recording rules

https://gerrit.wikimedia.org/r/914946

Change 914945 merged by RLazarus:

[operations/puppet@production] thanos: Migrate from 100-scale to unit-scale SLO recording rules

https://gerrit.wikimedia.org/r/914945

Change #717473 abandoned by Herron:

[operations/puppet@production] thanos::rule: add cluster_site:sli_etcd_http_error_ratio:rate5m recording rule

Reason:

spring cleaning -- stale patch

https://gerrit.wikimedia.org/r/717473