Page MenuHomePhabricator

Sloth: onboard subset of existing SLOs to pilot
Closed, ResolvedPublic

Description

  • editcheck
    • backfill editcheck sloth rule metrics from 2025-06-01 to 2025-11-01
  • tonecheck
  • xlab
  • wikifunctions

Event Timeline

# tonecheck.yml
version: "prometheus/v1"
service: "tonecheck"
labels:
  owner: "sre"
slos:
  - name: "tone-check-availability"
    objective: 95.0
    description: "Tone check pre save checks"
    sli:
      events:
        error_query: |
          sum(
            rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
            app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
            response_code=~"5..", prometheus="k8s-mlserve" }[{{.window}}])
          )
        total_query: |
          sum(
            rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
            app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
            prometheus="k8s-mlserve" }[{{.window}}])
          )
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/tonecheck.yml -o /data/tonecheck-out.yaml
INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2
INFO[0000] Plugins loaded                                sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
# editcheck.yml
version: "prometheus/v1"
service: "edit-check"
labels:
  owner: "sre"
slos:
  - name: "edit-check-pre-save-checks-ratio"
    objective: 99.0
    description: "Edit check pre save checks"
    sli:
      events:
        error_query: |
          sum(
            rate(editcheck_sli_presavechecks_shown_vs_available_total[{{.window}}])
          )
        total_query: |
          sum(
            rate(editcheck_sli_presavechecks_available_total[{{.window}}])
          )
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/editcheck.yml -o /data/editcheck-out.yaml
INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2
INFO[0000] Plugins loaded                                sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
# xlab.yml
version: "prometheus/v1"
service: "xlab"
labels:
  owner: "sre"
slos:
  - name: "xlab-standalone-event-validation-success-rate"
    objective: 95
    description: "xlab standalone event validation success rate"
    sli:
      events:
        error_query: |
          sum(
              rate(eventgate_validation_errors_total{service="eventgate-analytics-external", stream="product_metrics.web_base",
                   error_type=~"HoistingError|MalformedHeaderError|ValidationError", prometheus="k8s"}[{{.window}}])
          )
          or vector(0)
        total_query: |
            sum(
                rate(eventgate_events_produced_total{service="eventgate-analytics-external", stream="product_metrics.web_base", prometheus="k8s"}[{{.window}}])
            ) +
            sum(
                rate(eventgate_validation_errors_total{service="eventgate-analytics-external", error_type=~"HoistingError|MalformedHeaderError", prometheus="k8s"}[{{.window}}])
            )
            or vector (0)
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/xlab.yml -o /data/xlab-out.yaml
INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2
INFO[0000] Plugins loaded                                sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d

sloth editcheck has been backfilled for range --start=2025-06-01T00:00:00Z --end=2025-11-01T00:00:00Z

ts=2025-11-05T21:23:01.789103977Z caller=sidecar.go:254 level=info msg="successfully loaded prometheus external labels" external_labels="{recorder=\"backfill\", replica=\"backfill-slec1\"}"
ts=2025-11-05T21:28:30.624989819Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24F9VB7BR7WJS0694R8N
ts=2025-11-05T21:28:31.495716712Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24GXYFEJJ70T5E16XD6G
ts=2025-11-05T21:28:32.454824737Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24JE7KQ6ZCD64TESE6QE
ts=2025-11-05T21:28:33.338885526Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24KZYX2M99Z2KXDVZX78
ts=2025-11-05T21:28:34.357171737Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24NWRVF2XK8TCQRQFJ6A
ts=2025-11-05T21:28:35.326738977Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24QTGA0BD56SQHCS8Y1B
ts=2025-11-05T21:28:36.397025309Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24SPSY9WKV1JVR31H592
ts=2025-11-05T21:28:37.516996311Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY24VF7A1SK9X6NB1M4J7Q
ts=2025-11-05T21:28:38.572357514Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY1ZR975J4MWYP1M17SYF6
ts=2025-11-05T21:28:39.370165791Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AY1HF76KST12XRADY400ET
ts=2025-11-05T21:28:40.122274698Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AXYD62HGWNVC9A4T6N1720
ts=2025-11-05T21:28:40.938620272Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AXYDG1BEMQH756SXQ8R96R
ts=2025-11-05T21:28:41.749643049Z caller=shipper.go:372 level=info msg="upload new block" id=01K9AXYDTKEBPZ8FPXT4V8XN1B

I notice that the backfill metrics are idle at the beginning of the range and then pick up around Sept 8, but this matches the underlying query. Here's quick thanos link comparing the two https://w.wiki/FxGv

editcheck's metrics seem to lead to:

execution: found duplicate series for the match group {sloth_id="edit-check-edit-check-pre-save-checks-ratio"} on the right hand-side of the operation: [{owner="sre", recorder="thanos-rule@pilot", sloth_id="edit-check-edit-check-pre-save-checks-ratio", sloth_service="edit-check", sloth_slo="edit-check-pre-save-checks-ratio"}, {owner="sre", recorder="backfill", sloth_id="edit-check-edit-check-pre-save-checks-ratio", sloth_service="edit-check", sloth_slo="edit-check-pre-save-checks-ratio"}];many-to-many matching not allowed: matching labels must be unique on one side

It seems an issue due to multiple recorder label types, is there anything that we can do?

Edit: the query leading to the above error contains group_left(), so having the two recorder labels is not great. I worked around it wrapping the metric with sum without(recorder) (metric) and the issue went away.

@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.

@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.

We chatted about this on IRC, TLDR it was backfilled but due to nested recording rules we'll need to break the the nested rules up and backfill them in multiple passes.

herron> ah also elukey, re: backfilling -- there may be some new behavior here to sort out, I think the way sloth structures nested recording rules may require multiple backfill passes on certain subsets of recording rules
11:06 AM <herron> that one you requested was backfilled, but its based on an expression that uses another recording rule
11:06 AM <herron> and that one is based on yet anothe recording rule, etc etc
11:06 AM <elukey> oh ok right right
11:06 AM <elukey> recording rule inception
11:06 AM <herron> so that's a ball of string to unravel haha yeah exactly
11:07 AM <elukey> if it is cumbersome it is not super important, it was just to visualize the rolling window on a wider timespam
11:07 AM <elukey> *span
11:07 AM <elukey> we can wait a couple of weeks
11:07 AM <herron> ok, yeah I think at a minimum we can document/note the behavior in the pilot doc.  we'll want to sort that out in the long term

onboarded wikifunctions today as well with config:

# This example shows a simple service level by implementing a single SLO without alerts.
# It disables page (critical) and ticket (warning) alerts.
# The SLO SLI measures the event errors as the http request respones with the code >=500 and 429.
#
# `sloth generate -i ./examples/no-alerts.yml`
#
version: "prometheus/v1"
service: "wikifunctions"
labels:
  owner: "abstract-wikipedia"
slos:
  - name: "wikifunctions-backend-combined"
    objective: 98.5
    description: "wikifunctions backend combined SLO"
    sli:
      events:
        error_query: |
          sum(
            rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[{{.window}}])
          )
        total_query: |
          sum(
            rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[{{.window}}])
          )
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
~/sloth$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/wikifunctions.yml -o /data/wikifunctions-out.yml
INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2
INFO[0000] Plugins loaded                                sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
---
# Code generated by Sloth (3a24ef6384adfac15af1b8d5898e7a05bed2f5f0): https://github.com/slok/sloth.
# DO NOT EDIT.

groups:
- name: sloth-slo-sli-recordings-wikifunctions-wikifunctions-backend-combined
  rules:
  - record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[5m])
      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[5m])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 5m
  - record: slo:sli_error:ratio_rate30m
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[30m])
      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[30m])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 30m
  - record: slo:sli_error:ratio_rate1h
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[1h])
      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[1h])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 1h
  - record: slo:sli_error:ratio_rate2h
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[2h])
      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[2h])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 2h
  - record: slo:sli_error:ratio_rate6h
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[6h])
      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[6h])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 6h
  - record: slo:sli_error:ratio_rate1d
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[1d])

      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[1d])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 1d
  - record: slo:sli_error:ratio_rate3d
    expr: |
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket{status=~"200|4..", le="10"}[3d])
      )
      )
      /
      (sum(
        rate(mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count[3d])
      )
      )
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 3d
  - record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}[30d])
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_window: 30d
- name: sloth-slo-meta-recordings-wikifunctions-wikifunctions-backend-combined
  rules:
  - record: slo:objective:ratio
    expr: vector(0.985)
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
  - record: slo:error_budget:ratio
    expr: vector(1-0.985)
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
  - record: slo:time_period:days
    expr: vector(30)
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
  - record: slo:current_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate5m{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
  - record: slo:period_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate30d{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="wikifunctions-wikifunctions-backend-combined", sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
  - record: slo:period_error_budget_remaining:ratio
    expr: 1 - slo:period_burn_rate:ratio{sloth_id="wikifunctions-wikifunctions-backend-combined",
      sloth_service="wikifunctions", sloth_slo="wikifunctions-backend-combined"}
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
  - record: sloth_slo_info
    expr: vector(1)
    labels:
      owner: abstract-wikipedia
      sloth_id: wikifunctions-wikifunctions-backend-combined
      sloth_mode: cli-gen-prom
      sloth_objective: "98.5"
      sloth_service: wikifunctions
      sloth_slo: wikifunctions-backend-combined
      sloth_spec: prometheus/v1
      sloth_version: 3a24ef6384adfac15af1b8d5898e7a05bed2f5f0

Updated the wikifunctions slot pilot SLO to enable low priority "ticket" alerting

alerting:
  name: SlothPilotSLOBudgetBurn
  labels:
    notes: "test please ignore"
  annotations:
    runbook: "https://phabricator.wikimedia.org/T404171"
  page_alert:
    disable: true
  ticket_alert:
    labels:
      severity: warning
      team: o11y

Which yields the below rule:

name: SlothPilotSLOBudgetBurn
expr: (max without (sloth_window) (slo:sli_error:ratio_rate2h{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (3 * 0.015)) and max without (sloth_window) (slo:sli_error:ratio_rate1d{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (3 * 0.015))) or (max without (sloth_window) (slo:sli_error:ratio_rate6h{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (1 * 0.015)) and max without (sloth_window) (slo:sli_error:ratio_rate3d{sloth_id="wikifunctions-wikifunctions-backend-combined",sloth_service="wikifunctions",sloth_slo="wikifunctions-backend-combined"} > (1 * 0.015)))
labels:
notes: test please ignore
recorder: thanos-rule@pilot
replica: a
severity: warning
sloth_severity: ticket
team: o11y
annotations:
runbook: https://phabricator.wikimedia.org/T404171
summary: {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is over expected.
title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is too fast.

And does indeed fire an alert (alerts.wm.o below)

Screenshot 2025-12-08 at 3.04.23 PM.png (412×1 px, 146 KB)

herron claimed this task.

Closing this as pilot onboarding has finished, wider onboarding will be tracked in parent task!