Thanos rule evaluation alerts for service_slis
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	fgiunchedi
	Jan 30 2023, 2:11 PM

Description

With the sli recording rules added as part of T323064: Create WDQS Uptime SLO dashboard in Grizzly we're pushing the threshold were evaluating a rule (group) takes longer than the evaluation interval (60s) leading to alerts:

2023-01-30-145501_1229x792_scrot.png (792×1 px, 104 KB)

AFAICT this is due to the fact that trafficserver_backend_requests_seconds_count is a "big" metric (broken down by single cache host for example) and thus takes a while to query across the infra, even more so when asking 90/91/92 days of history.

In total rule evaluation for service_slis group jumped from <10s to >60s (i.e. not being able to keep up with evaluation interval)

2023-01-30-150232_1241x830_scrot.png (830×1 px, 119 KB)

In terms of solutions/mitigations we can:

in the short term isolate the wdqs sli into their own rule group
also in the short term increase the evaluation interval of said group to evaluate every 3-4 minutes
medium term, evaluate (hah!) if we can reformulate these rules in terms of lower-cardinality and pre-aggregated rules for example in modules/profile/files/prometheus/rules_ops.yml for per-backend ATS availability we have:

- record: job_backend:trafficserver_backend_requests:avail5m
  expr: sum by(backend, job) (job_method_status_backend_layer:trafficserver_backend_requests_seconds_count:rate5m{status=~"5.."})
    / sum by(backend, job) (job_method_status_backend_layer:trafficserver_backend_requests_seconds_count:rate5m{status=~"[12345].."})

Details

Subject	Repo	Branch	Lines +/-
wdqs: no longer need recording rule	operations/puppet	production	+0 -16
wdqs: try alternative slo query approach	operations/grafana-grizzly	master	+1 -1
wdqs: make sli uptime use pre-existing metric	operations/grafana-grizzly	master	+2 -2
thanos: split wdqs SLIs in a new group	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	RKemper	T313751 Create WDQS uptime SLO
Resolved	RKemper	T323064 Create WDQS Uptime SLO dashboard in Grizzly
Resolved	RKemper	T328306 Thanos rule evaluation alerts for service_slis

Event Timeline

fgiunchedi created this task.Jan 30 2023, 2:11 PM

Change 884906 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: split wdqs SLIs in a new group

https://gerrit.wikimedia.org/r/884906

gerritbot added a project: Patch-For-Review.Jan 30 2023, 2:39 PM

Change 884906 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: split wdqs SLIs in a new group

https://gerrit.wikimedia.org/r/884906

ok the short term bandaid did help:

(sum by(job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) > sum by(job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}))

2023-01-30-161307_1262x918_scrot.png (918×1 px, 94 KB)

Going through lower cardinality metrics is still to be discussed, happy to assist in the discussion but unfortunately I can't directly follow the implementation/task. What do you think @RKemper @herron ?

Maintenance_bot removed a project: Patch-For-Review.Jan 30 2023, 3:30 PM

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Jan 30 2023, 4:46 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

MPhamWMF moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Thanks @fgiunchedi. I'll pop by observbility IRC channel tomorrow (weds america daytime) and see if there's any ideas on lowering the cardinality

RKemper moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.Mar 13 2023, 4:10 PM

Gehel assigned this task to RKemper.Mar 13 2023, 4:10 PM

Change 900430 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/grafana-grizzly@master] [WIP] wdqs: test new metric option

https://gerrit.wikimedia.org/r/900430

gerritbot added a project: Patch-For-Review.Mar 16 2023, 7:49 PM

Gehel set the point value for this task to 8.Mar 20 2023, 7:39 PM

Change 900430 merged by Ryan Kemper:

[operations/grafana-grizzly@master] wdqs: make sli uptime use pre-existing metric

https://gerrit.wikimedia.org/r/900430

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2023, 6:26 PM

Change 911936 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/grafana-grizzly@master] wdqs: try alternative slo query approach

https://gerrit.wikimedia.org/r/911936

gerritbot added a project: Patch-For-Review.Apr 25 2023, 9:33 PM

Change 911936 merged by Ryan Kemper:

[operations/grafana-grizzly@master] wdqs: try alternative slo query approach

https://gerrit.wikimedia.org/r/911936

Change 912382 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: no longer need recording rule

https://gerrit.wikimedia.org/r/912382

Change 912382 merged by Ryan Kemper:

[operations/puppet@production] wdqs: no longer need recording rule

https://gerrit.wikimedia.org/r/912382

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 10:10 PM

Gehel added a project: Data-Platform-SRE.May 2 2023, 8:28 AM

Gehel moved this task from Incoming to In Progress on the Data-Platform-SRE board.

@Gehel With the recording rule removed in https://gerrit.wikimedia.org/r/912382, there shouldn't be any performance issues since we're not recording anything. The latest query settings in https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938 and previous patches are sufficient for acceptable performance on the query, i.e. we don't get timeouts when viewing the graph.

So this should be ready to be marked resolved.

Gehel moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.May 15 2023, 3:16 PM

Gehel closed this task as Resolved.Jun 2 2023, 9:43 AM

Gehel moved this task from Needs Reporting to Done on the Data-Platform-SRE board.Jul 19 2023, 8:50 AM

	F36555955: 2023-01-30-161307_1262x918_scrot.png
	Jan 30 2023, 3:14 PM

	F36555846: 2023-01-30-150232_1241x830_scrot.png
	Jan 30 2023, 2:11 PM

	F36555836: 2023-01-30-145501_1229x792_scrot.png
	Jan 30 2023, 2:11 PM

Thanos rule evaluation alerts for service_slisClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Thanos rule evaluation alerts for service_slis
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...