Sloth provides a monthly dashboard view by default, we'll attempt to extend this into a quarter view.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T403729 Pyrra calculations for the Initial error budget value of calendar windows | |||
| Open | None | T404171 Evaluate Sloth as a possible replacement for Pyrra | |||
| Resolved | herron | T409312 Sloth: adapt default month view to quarter view (pilot) |
Event Timeline
Off hand the sloth detail dashboards "month error budget burn chart" panel uses Grafana built-ins in the "relative time" and "time shift" to fix the panel on the current month.
I'm not aware of a native quarterly variant in Grafana so we'll have to sort out a reliable workaround for this.
This is the current query for a month:
1-(
sum_over_time(
(
slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"}
* on() group_left() (
month() == bool vector(${__to:date:M})
)
)[32d:1h]
)
/ on(sloth_id)
(
slo:error_budget:ratio{sloth_service="${service}",sloth_slo="${slo}"} *on() group_left() (24 * days_in_month())
)
)The horror query for the 3 months quarterly window would be:
1-(
sum_over_time(
(
slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"}
* on() group_left() (
(month() >= bool vector(${__from:date:M})) and
(month() <= bool vector(${__to:date:M} ))
)
)[94d:1h]
)
/ on(sloth_id)
(
slo:error_budget:ratio{sloth_service="${service}",sloth_slo="${slo}"}
*on() group_left()
(24 * (days_in_month(time() - 86400*62) + days_in_month(time() - 86400*31) + days_in_month()))
)
)The slo::error_budget::ratio recording rule for edit check is simply vector(1-0.99), so nothing dramatic.
The bit to work on would be:
(24 * (days_in_month(time() - 86400*62) + days_in_month(time() - 86400*31) + days_in_month()))
Because this one kinda assumes that the 3 month window ends the current day, meanwhile we need to calculate the values probably using something like:
24 * (days_in_month(__to:date) + days_in_month(__from:date) + days_in_month(__to:date 86400*40 ))
Basically: to get the number of hours to use in the denominator, sum the number of days in the various months.
New version of the two queries for the quarterly sloth panel:
1-(
sum_over_time(
(
slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"}
* on() group_left() (
(month() >= bool vector(${__from:date:M})) and
(month() <= bool vector(${__to:date:M} ))
)
)[94d:1h]
)
/ on(sloth_id)
(
slo:error_budget:ratio{sloth_service="${service}",sloth_slo="${slo}"}
*on() group_left()
(
24 * (
days_in_month(vector(${__from:date:seconds})) + days_in_month(vector(${__from:date:seconds}) + 86400 * 40) + days_in_month(vector(${__to:date:seconds}))
)
)
)
)1 - sum_over_time(
(
(1 / ((days_in_month(vector(${__from:date:seconds})) + days_in_month(vector(${__from:date:seconds}) + 86400 * 40) + days_in_month(vector(${__to:date:seconds}))) * 24)) *
(month() >= bool vector(${__from:date:M})) and
(month() <= bool vector(${__to:date:M} ))
)[94d:1h]
)Not fully working yet, I'll keep working on it. It needs to be applied with the time picker set between Sep 1st and Nov 30th (or any other similar range).
One thing that I cannot solve is that vector(${__to:date:seconds}) returns a unix ts for Mon Dec 1 12:59:59 AM CET 2025 and 12 when selecting the month, while the time picker in grafana is set from Sep 1st to Nov 30th (happens the same in Grafana explore). I have no idea is I am missing something stupid or not..
I had to use sum without(recorder) since the backfill process for edit-check caused another label to be added, ending up in errors while evaluating the group_left() (many-to-many relationship).
1-(
sum_over_time(
(
sum without(recorder) (slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"})
* on() group_left() (
(month() >= bool vector(${__from:date:M})) and
(month() <= bool vector(${__to:date:M} ))
)
)[94d:1h]
)
/ on(sloth_id)
(
sum without(recorder) (slo:error_budget:ratio{sloth_service="${service}",sloth_slo="${slo}"})
*on() group_left()
(
24 * (
days_in_month(vector(${__from:date:seconds})) + days_in_month(vector(${__from:date:seconds}) + 86400 * 40) + days_in_month(vector(${__to:date:seconds}))
)
)
)
)It still doesn't work with edit-check since it shows data only for the first days of September (using a Sept->Nov fixed time window in Grafana).
Me and @tappof spent quite a bit of time today trying to debug the above problem, namely that the graph showed only some days in September and nothing more. The issue seemed the sum_over_time applied to:
sum without(recorder) (slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"})
* on() group_left() (
(month() >= bool vector(${__from:date:M})) and
(month() <= bool vector(${__to:date:M} ))
)Tiziano then figured out that the issue lied in the usage of sum without(recorder), because for probably internal-optimization workflows it doesn't work well with sum_over_time. To refresh, I added the extra sum without(recorder) to avoid backfilled data to cause a many-to-many relationship error when evaluating the group_left(). Tiziano replaced it with max() or vector(0) and the results can be seen in the grafana dashboard.
As far as I can see we have a first working (or at least, appearing to be consistent) way of displaying the error budget trend over a quarter! I updated the dashboard to use a fixed time range and tried to adjust other panels, not everything is consistent and works but I think this is a great first step.
We'll need to backfill other SLOs to be able to verify that everything works as expected.
Just updated the dashboard: https://grafana.wikimedia.org/goto/PzmXbiWvg?orgId=1
Quarter Error Budget Burn Rate:
- Use timestamps (which always increase and never reset) to define the time range instead of start/end months;
- The graph shows the calendar burn rate for the selected window; choosing a quarter-long window gives the "quarter error budget burn rate";
- Removed the "Time Shift" and "Relative Time" options from the query settings, as the query now relies entirely on the main time picker;
- Set the dashboard’s default time range to the current quarter.
@herron this task should be good in my opinion for the pilot's goals, we'll may need to tune it a little further if we decide to use Sloth but I wouldn't spend a ton of time on it in Q2. Lemme know!
Made a couple more adjustments to https://grafana.wikimedia.org/d/slot-pilot-slo-detail/sloth-s-l-o-detail to clean up the rolling window portion
* Updated fiscal year start month to July * Rolling window: * Update panel options to display past 30 days leading up to "now" instead of current month * Set clamp_min to display negative values as "0% budget remaining" instead of missing values (min value 0 displayed in panel)
I'm not sure if the rolling window would remain on the quarterly view in the long term but looks good for the pilot IMO
