- editcheck
- backfill editcheck sloth rule metrics from 2025-06-01 to 2025-11-01
- tonecheck
- xlab
- wikifunctions
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T403729 Pyrra calculations for the Initial error budget value of calendar windows | |||
| Resolved | herron | T404171 Evaluate Sloth as a possible replacement for Pyrra | |||
| Resolved | herron | T409310 Sloth: onboard subset of existing SLOs to pilot |
Event Timeline
# tonecheck.yml
version: "prometheus/v1"
service: "tonecheck"
labels:
owner: "sre"
slos:
- name: "tone-check-availability"
objective: 95.0
description: "Tone check pre save checks"
sli:
events:
error_query: |
sum(
rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
response_code=~"5..", prometheus="k8s-mlserve" }[{{.window}}])
)
total_query: |
sum(
rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
prometheus="k8s-mlserve" }[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/tonecheck.yml -o /data/tonecheck-out.yaml INFO[0000] SLO period windows loaded svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2 INFO[0000] Plugins loaded sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
# editcheck.yml
version: "prometheus/v1"
service: "edit-check"
labels:
owner: "sre"
slos:
- name: "edit-check-pre-save-checks-ratio"
objective: 99.0
description: "Edit check pre save checks"
sli:
events:
error_query: |
sum(
rate(editcheck_sli_presavechecks_shown_vs_available_total[{{.window}}])
)
total_query: |
sum(
rate(editcheck_sli_presavechecks_available_total[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/editcheck.yml -o /data/editcheck-out.yaml INFO[0000] SLO period windows loaded svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2 INFO[0000] Plugins loaded sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
# xlab.yml
version: "prometheus/v1"
service: "xlab"
labels:
owner: "sre"
slos:
- name: "xlab-standalone-event-validation-success-rate"
objective: 95
description: "xlab standalone event validation success rate"
sli:
events:
error_query: |
sum(
rate(eventgate_validation_errors_total{service="eventgate-analytics-external", stream="product_metrics.web_base",
error_type=~"HoistingError|MalformedHeaderError|ValidationError", prometheus="k8s"}[{{.window}}])
)
or vector(0)
total_query: |
sum(
rate(eventgate_events_produced_total{service="eventgate-analytics-external", stream="product_metrics.web_base", prometheus="k8s"}[{{.window}}])
) +
sum(
rate(eventgate_validation_errors_total{service="eventgate-analytics-external", error_type=~"HoistingError|MalformedHeaderError", prometheus="k8s"}[{{.window}}])
)
or vector (0)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/xlab.yml -o /data/xlab-out.yaml INFO[0000] SLO period windows loaded svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2 INFO[0000] Plugins loaded sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
sloth editcheck has been backfilled for range --start=2025-06-01T00:00:00Z --end=2025-11-01T00:00:00Z
ts=2025-11-05T21:23:01.789103977Z caller=sidecar.go:254 level=info msg="successfully loaded prometheus external labels" external_labels="{recorder=\"backfill\", replica=\"backfill-slec1\"}"
ts=2025-11-05T21:28:30.624989819Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24F9VB7BR7WJS0694R8N
ts=2025-11-05T21:28:31.495716712Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24GXYFEJJ70T5E16XD6G
ts=2025-11-05T21:28:32.454824737Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24JE7KQ6ZCD64TESE6QE
ts=2025-11-05T21:28:33.338885526Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24KZYX2M99Z2KXDVZX78
ts=2025-11-05T21:28:34.357171737Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24NWRVF2XK8TCQRQFJ6A
ts=2025-11-05T21:28:35.326738977Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24QTGA0BD56SQHCS8Y1B
ts=2025-11-05T21:28:36.397025309Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24SPSY9WKV1JVR31H592
ts=2025-11-05T21:28:37.516996311Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24VF7A1SK9X6NB1M4J7Q
ts=2025-11-05T21:28:38.572357514Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY1ZR975J4MWYP1M17SYF6
ts=2025-11-05T21:28:39.370165791Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY1HF76KST12XRADY400ET
ts=2025-11-05T21:28:40.122274698Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AXYD62HGWNVC9A4T6N1720
ts=2025-11-05T21:28:40.938620272Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AXYDG1BEMQH756SXQ8R96R
ts=2025-11-05T21:28:41.749643049Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AXYDTKEBPZ8FPXT4V8XN1BI notice that the backfill metrics are idle at the beginning of the range and then pick up around Sept 8, but this matches the underlying query. Here's quick thanos link comparing the two https://w.wiki/FxGv
editcheck's metrics seem to lead to:
execution: found duplicate series for the match group {sloth_id="edit-check-edit-check-pre-save-checks-ratio"} on the right hand-side of the operation: [{owner="sre", recorder="thanos-rule@pilot", sloth_id="edit-check-edit-check-pre-save-checks-ratio", sloth_service="edit-check", sloth_slo="edit-check-pre-save-checks-ratio"}, {owner="sre", recorder="backfill", sloth_id="edit-check-edit-check-pre-save-checks-ratio", sloth_service="edit-check", sloth_slo="edit-check-pre-save-checks-ratio"}];many-to-many matching not allowed: matching labels must be unique on one sideIt seems an issue due to multiple recorder label types, is there anything that we can do?
Edit: the query leading to the above error contains group_left(), so having the two recorder labels is not great. I worked around it wrapping the metric with sum without(recorder) (metric) and the issue went away.
@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.
We chatted about this on IRC, TLDR it was backfilled but due to nested recording rules we'll need to break the the nested rules up and backfill them in multiple passes.
herron> ah also elukey, re: backfilling -- there may be some new behavior here to sort out, I think the way sloth structures nested recording rules may require multiple backfill passes on certain subsets of recording rules 11:06 AM <herron> that one you requested was backfilled, but its based on an expression that uses another recording rule 11:06 AM <herron> and that one is based on yet anothe recording rule, etc etc 11:06 AM <elukey> oh ok right right 11:06 AM <elukey> recording rule inception 11:06 AM <herron> so that's a ball of string to unravel haha yeah exactly 11:07 AM <elukey> if it is cumbersome it is not super important, it was just to visualize the rolling window on a wider timespam 11:07 AM <elukey> *span 11:07 AM <elukey> we can wait a couple of weeks 11:07 AM <herron> ok, yeah I think at a minimum we can document/note the behavior in the pilot doc. we'll want to sort that out in the long term
onboarded wikifunctions today as well with config:
# This example shows a simple service level by implementing a single SLO without alerts.
# It disables page (critical) and ticket (warning) alerts.
# The SLO SLI measures the event errors as the http request respones with the code >=500 and 429.
#
# `sloth generate -i ./examples/no-alerts.yml`
#
version: "prometheus/v1"
service: "wikifunctions"
labels:
owner: "abstract-wikipedia"
slos:
- name: "wikifunctions-backend-combined"
objective: 98.5
description: "wikifunctions backend combined SLO"
sli:
events:
error_query: |
sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[{{.window}}])
)
total_query: |
sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true~/sloth$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/wikifunctions.yml -o /data/wikifunctions-out.yml INFO[0000] SLO period windows loaded svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2 INFO[0000] Plugins loaded sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
---
# Code generated by Sloth (3a24ef6384adfac15af1b8d5898e7a05bed2f5f0): https://github.com/slok/sloth.
# DO NOT EDIT.
groups:
- name: sloth-slo-sli-recordings-wikifunctions-wikifunctions-backend-combined
rules:
- record: slo:sli_error:ratio_rate5m
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[5m])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[5m])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 5m
- record: slo:sli_error:ratio_rate30m
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[30m])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[30m])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 30m
- record: slo:sli_error:ratio_rate1h
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[1h])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[1h])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 1h
- record: slo:sli_error:ratio_rate2h
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[2h])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[2h])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 2h
- record: slo:sli_error:ratio_rate6h
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[6h])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[6h])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 6h
- record: slo:sli_error:ratio_rate1d
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[1d])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[1d])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 1d
- record: slo:sli_error:ratio_rate3d
expr: |
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[3d])
)
)
/
(sum(
rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[3d])
)
)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 3d
- record: slo:sli_error:ratio_rate30d
expr: |
sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}[30d])
/ ignoring (sloth_window)
count_over_time(slo:sli_error:ratio_rate5m{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}[30d])
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_window: 30d
- name: sloth-slo-meta-recordings-wikifunctions-wikifunctions-backend-combined
rules:
- record: slo:objective:ratio
expr: vector(0.985)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
- record: slo:error_budget:ratio
expr: vector(1-0.985)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
- record: slo:time_period:days
expr: vector(30)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
- record: slo:current_burn_rate:ratio
expr: |
slo:sli_error:ratio_rate5m{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
- record: slo:period_burn_rate:ratio
expr: |
slo:sli_error:ratio_rate30d{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
- record: slo:period_error_budget_remaining:ratio
expr: 1 - slo:period_burn_rate:ratio{sloth_id="wikifunctions-wikifunctions-backend-combined",
sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
- record: sloth_slo_info
expr: vector(1)
labels:
owner: abstract-wikipedia
sloth_id: wikifunctions-wikifunctions-backend-combined
sloth_mode: cli-gen-prom
sloth_objective: "98.5"
sloth_service: wikifunctions
sloth_slo: wikifunctions-backend-combined
sloth_spec: prometheus/v1
sloth_version: 3a24ef6384adfac15af1b8d5898e7a05bed2f5f0Updated the wikifunctions slot pilot SLO to enable low priority "ticket" alerting
alerting:
name: SlothPilotSLOBudgetBurn
labels:
notes: "test please ignore"
annotations:
runbook: "https://phabricator.wikimedia.org/T404171"
page_alert:
disable: true
ticket_alert:
labels:
severity: warning
team: o11yWhich yields the below rule:
name: SlothPilotSLOBudgetBurn
expr: (max without (sloth_window) (slo:sli_error:ratio_rate2h{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (3 * 0.015)) and max without (sloth_window) (slo:sli_error:ratio_rate1d{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (3 * 0.015))) or (max without (sloth_window) (slo:sli_error:ratio_rate6h{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (1 * 0.015)) and max without (sloth_window) (slo:sli_error:ratio_rate3d{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (1 * 0.015)))
labels:
notes: test please ignore
recorder: thanos-rule@pilot
replica: a
severity: warning
sloth_severity: ticket
team: o11y
annotations:
runbook: https://phabricator.wikimedia.org/T404171
summary: {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is over expected.
title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is too fast.And does indeed fire an alert (alerts.wm.o below)
Closing this as pilot onboarding has finished, wider onboarding will be tracked in parent task!
