Page MenuHomePhabricator

Pyrra calculations for the Initial error budget value of calendar windows
Open, Needs TriagePublic

Description

The Q2 SLO quarter started a few days ago, and I am investigating why a lot of SLOs show their error budgets starting from low values, very far from 100%.

As example, I am taking the Citoid availability SLO's calendar view: dashboard

The error budget starts at around 40%, and this can indeed confused a lot of folks that may expect something close to 100%. The same issue happens for the rolling window dashboard, so I inspected the Prometheus/Thanos link to dive deep into how the remaining error budget value is calculated.

At a high level, Pyrra does this: ((1 - SLO target) - error-requests/total-requests) / (1 - SLO target)
In the Citoid's use case, this translates to: ((1-0.995) - error-requests/total-requests) / (1 - 0.995) = (0.005 - error-requests/total-requests) / 0.005

The above makes sense, because we end up having a ratio of remaining error budget over total error budget, and the resulting value is what we are looking for.

Pyrra uses recording rules, so the more precise formula is: ((1 - SLO target) - increase(error-requests[4w]]) / increase(total-requests[4w) / (1 - SLO target)
The first datapoint that shows the starting error budget value is calculated starting from 4 weeks ago till now, basically ending up with a ratio between the amount of error requests happened in the past 4 weeks over the total requests happened during the same time frame. The tight SLO target 0.995 implies a very low error budget value, 0.005, that is eroded straight away in the first datapoint displayed (since the error/total request ratio ends up at ~0.003).

The more the SLO target is lowered, the higher the starting error budget is, that seems to be good but in reality it hides something important: at the start of the quarter in a calendar window (and also in a rolling one but it is less problematic), we compute a value that refers to weeks before, something that we shouldn't really count.

Event Timeline

Current doubts:

  1. We could lower down the window parameter set in Pyrra, from 4w to something like 1d, and we after that we should be less dependent on the "past" weeks of datapoints. Due to how increase() is implemented in Prometheus (IIUC, the calculation is done taking the the current value of the counter (now) subtracting from it the one at the beginning of the range, like a day ago or 4w ago) we may not get a correct rendering of the remaining error budget. Every datapoint, related to the current remaining error budget, would be calculated using data from a day ago, rather than from the beginning of the SLO time window (beginning of a quarter for calendar, last month for rolling for example).
  2. I didn't find any documentation/suggestion from Pyrra about how to set the time window, nor I found questions/discussions on github. I'd love to get the developer's option on it, if I don't find a good answer I'll cut a github issue and link it to this task.
  3. Is there a better way to do it? Should we take a compromise/approximation instead?

I checked what sloth does and for the availability SLO these are the relevant resources: sloth recording rules.
The increase() function is not used, but they use rate() with short time ranges. I then checked their grafana dashboard's JSON, the error budget calculations are:

"description": "This month remaining error budget, starts the 1st of the month and ends  28th-31st (not rolling window)",
[..]
      "targets": [
        {
          "exemplar": false,
          "expr": "1-(\n  sum_over_time(\n    (\n       slo:sli_error:ratio_rate1h{sloth_service=\"${service}\",sloth_slo=\"${slo}\"}\n       * on() group_left() (\n         month() == bool vector(${__to:date:M})\n       )\n    )[32d:1h]\n  )\n  / on(sloth_id)\n  (\n    slo:error_budget:ratio{sloth_service=\"${service}\",sloth_slo=\"${slo}\"} *on() group_left() (24 * days_in_month())\n  )\n)",
          "instant": true,
          "interval": "1h",
          "legendFormat": "Remaining error budget (month)",
          "queryType": "randomWalk",
          "refId": "A"
        }

-----------

"description": "A rolling window of the total period (30d) error budget remaining.",
[..]
      "targets": [
        {
          "exemplar": false,
          "expr": "slo:period_error_budget_remaining:ratio{sloth_service=\"${service}\", sloth_slo=\"${slo}\"} or on() vector(1)",
          "instant": true,
          "interval": "",
          "legendFormat": "Remaining error budget (30d window)",
          "queryType": "randomWalk",
          "refId": "A"
        }

The above configs look like a calendar view and a rolling view, and they used sum_over_time instead. In the second case, these are the relevant recording rules:

- record: slo:period_burn_rate:ratio
  expr: |
    slo:sli_error:ratio_rate30d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
    / on(sloth_id, sloth_slo, sloth_service) group_left
    slo:error_budget:ratio{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
  labels:
    cmd: examplesgen.sh
    owner: myteam
    repo: myorg/myservice
    sloth_id: myservice-requests-availability
    sloth_service: myservice
    sloth_slo: requests-availability
    tier: "2"

- record: slo:period_error_budget_remaining:ratio
  expr: 1 - slo:period_burn_rate:ratio{sloth_id="myservice-requests-availability",
    sloth_service="myservice", sloth_slo="requests-availability"}
  labels:
    cmd: examplesgen.sh
    owner: myteam
    repo: myorg/myservice
    sloth_id: myservice-requests-availability
    sloth_service: myservice
    sloth_slo: requests-availability
    tier: "2"

- record: slo:sli_error:ratio_rate30d
  expr: |
    sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
    / ignoring (sloth_window)
    count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
  labels:
    cmd: examplesgen.sh
    owner: myteam
    repo: myorg/myservice
    sloth_id: myservice-requests-availability
    sloth_service: myservice
    sloth_slo: requests-availability
    sloth_window: 30d
    tier: "2"

- record: slo:sli_error:ratio_rate5m
  expr: |
    (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
    /
    (sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))
  labels:
    cmd: examplesgen.sh
    owner: myteam
    repo: myorg/myservice
    sloth_id: myservice-requests-availability
    sloth_service: myservice
    sloth_slo: requests-availability
    sloth_window: 5m
    tier: "2"

A lot more rules but the overall calculation seems a better approximation than the Pyrra one (at least from a quick glance).

This is an example of what could become confusing for an SLO owner:

Screenshot From 2025-09-19 16-49-04.png (2×3 px, 214 KB)

The Tone Check's service SLO is related to the amount of HTTP 200 responses under 1s, compared to the overall HTTP 200s. The window in Pyrra was set at 4w, and at first it is not really clear to understand what's happening due to the increase in the remaining error budget over time. The ML team improved a lot the service during the past weeks since without a GPU, they were not able to make the 1s limit for most of the HTTP 200 responses. And this is the effect of the 30d increase() recording rule: slowly over time the error budget is recovering, because the service became more stable.

The latency dashboards seem to be the most affected, I checked the availability ones and they look more in-line with what we expect. This is because the data is less sensitive and we haven't registered outages or similar, otherwise the increase() effect would be visible as well.

After a long research I think I found a good reply for the use case highlighted above.

In my head every SLO window should theoretically start from 100% of remaining error budget, to then show a decrease where the SLO is not met. Pyrra doesn't support this kind of graph, but something different: every datapoint can be seen as the remaining error budget at the end of the dynamic window that goes from one month before that datapoint to it. So it shows a trend, not necessarily starting from 100%. And the calendar window in Grafana is just the same view that we have in the Pyrra UI but over a longer period of time.

We have some high level things to discuss:

  1. The "rolling" view is something that we can probably get accustomed, we can write docs and work with SLO owners to interpret their graphs etc.. What about the calendar view? Do we want to have it reset to 100% at the beginning of the quarter?
  2. The Pyrra calendar view will never be able to correctly display/reset an error budget percent starting from 100, because of the limitations described in this task (see increase() 30d in the description for example). This is not necessarily bad, on the contrary it makes more sense when looking at the Pyrra UI (they are basically displaying the same data). This is an alternative view of things if in 1) we don't really want the reset-to-100% behavior in the calendar view.
  3. We may want to iterate through various solutions. For example, we could keep Pyrra and make clear in https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts that the calendar view has limitations, one above all that the first datapoints of the quarter will not be independent from the immediate past. Then we could evaluate other tools like Sloth and try different dashboards, asking users what they prefer.

I had a chat with Ilias (ML team) and one of their recent use cases is to be able to provide a "Proportion of all requests that return a response within 1000 milliseconds" for an A/B testing that will involve the Tone Check service. They'd need to be able to do this using custom fixed windows, corresponding to the duration of the tests. I am pretty sure that even with Sloth we wouldn't be able to accommodate this use case, but we could do something like this:

  1. Use the current Pyrra SLO graphs to measure the variation of error budget while the experiment is ongoing.
  2. Create a custom dashboard calculating the proportion that they need via raw Istio metrics (namely, without extra aggregation coming from recording rules).

We could compare results and see what is more readable, and see if the rolling SLO dashboard could be useful. I imagine that more and more use cases like this should become part of the standard SLO culture, and it would be great to define the ones that shouldn't be related to it. In any case, this could be an example of how we keep testing the Pyrra UI, while working on something different and getting feedback about it.

The other side of the coin, namely when there are no previous outages: T395444#11202029

Screenshot From 2025-09-22 17-38-40.png (2×3 px, 209 KB)

Screenshot From 2025-09-22 17-38-58.png (2×4 px, 509 KB)