The Q2 SLO quarter started a few days ago, and I am investigating why a lot of SLOs show their error budgets starting from low values, very far from 100%.
As example, I am taking the Citoid availability SLO's calendar view: dashboard
The error budget starts at around 40%, and this can indeed confused a lot of folks that may expect something close to 100%. The same issue happens for the rolling window dashboard, so I inspected the Prometheus/Thanos link to dive deep into how the remaining error budget value is calculated.
At a high level, Pyrra does this: ((1 - SLO target) - error-requests/total-requests) / (1 - SLO target)
In the Citoid's use case, this translates to: ((1-0.995) - error-requests/total-requests) / (1 - 0.995) = (0.005 - error-requests/total-requests) / 0.005
The above makes sense, because we end up having a ratio of remaining error budget over total error budget, and the resulting value is what we are looking for.
Pyrra uses recording rules, so the more precise formula is: ((1 - SLO target) - increase(error-requests[4w]]) / increase(total-requests[4w) / (1 - SLO target)
The first datapoint that shows the starting error budget value is calculated starting from 4 weeks ago till now, basically ending up with a ratio between the amount of error requests happened in the past 4 weeks over the total requests happened during the same time frame. The tight SLO target 0.995 implies a very low error budget value, 0.005, that is eroded straight away in the first datapoint displayed (since the error/total request ratio ends up at ~0.003).
The more the SLO target is lowered, the higher the starting error budget is, that seems to be good but in reality it hides something important: at the start of the quarter in a calendar window (and also in a rolling one but it is less problematic), we compute a value that refers to weeks before, something that we shouldn't really count.


