Page MenuHomePhabricator

Hardcode the SLO time windows in Grafana dashboards generated via Grizzly
Closed, ResolvedPublic

Description

This is a very nice chat that I had with @RLazarus and @herron over IRC about SLO dashboards (created via Grizzly) and their time ranges:

<elukey> 1) IIRC the time window for the SLO calculation is every quarter, meanwhile in the dashboards we have "last 90d". What is the correct 
         procedure? Setting the right time interval every beginning of a new quarter?
<elukey> like: first and last day of the quarter harcoded in the top-right time setting in grafana I mean
[..]
<herron> re: 1) I'll defer to rzl for the current reporting approach but I think what is done is essentially what you describe but offset by -1 month
[..]
<rzl> right - we evaluate them on calendar-quarters-offset-by-a-month, e.g. June 1 to August 31, so the dashboard only aligns with that only when the 
      date picker is set to e.g. 2023-06-01 00:00:00 to 2023-08-31 23:59:59, which presently you have to just put in by hand
<rzl> and, also right, the range should properly be 90d, 91d, or 92d depending on which quarter it is -- for the "official" SLO review at the end of 
      the quarter I do tweak it to the correct length, but for the dashboards we had to pick one, which isn't ideal
<rzl> my understanding is all of that will improve with Pyrra but I haven't actually synced with herron about that in a while :)
<herron> yes! we should, it got a bit of a setback with the cfssl thanos-fe issue but yeah would be good to talk about the path forward on it
<rzl> elukey: does that help? sorry about the warts in the meantime but your understanding is right
<elukey> rzl: yes definitely it helps! What I am trying to wrap my head on, at the moment, is how to use and review the SLO dashboards during the 
         quarter
<elukey> say that we are a month in, and I'd like to see how error budgets are doing etc...
<elukey> IIUC, in theory, I can set the correct dates in the time window and see data coming in as the week passes, right?
<elukey> (not 100% sure about the 90d range though, but with the correct time window it should pick end - 90d right?)
<elukey> if I am asking weird non-sense question is because I am trying to dump my understanding and figure out what parts I am missing :)
<rzl> no, 100% the right question
<jbond> hi all i came accross this the other day and thught it might be intresting to the folks here https://github.com/ddosify/alaz
<rzl> during the quarter, you should use the SLO quarter that's in progress -- so the end date will be in the future
<rzl> right now, you'd set it to 2023-09-01 00:00:00 to 2023-11-31 23:59:59, so everything from the June-August period is no longer counted, and 
      there's only a week's worth of data
<elukey> rzl: ack, makes sense - just a question though - Sept IIRC is in Q1, so I'd have set 2023-07-01 -> 2023-09-31
<rzl> for SLOs we actually use different dates, offset earlier by a quarter! the brief reason is, when we have SLO excursions and need to do 
      engineering work to fix it, we still have time to fit it into the next quarter's goals
<elukey> ah right totally forgot about that
<elukey> okok now it makes sense

The proposal that I have is https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/956878, so basically we hardcode the right time ranges and regenerate all the dashboards once every 3 months with a single commit (rather than forcing people to manually update the dashboards etc..).

Please let me know thoughts / doubts / suggestions /etc.. :)

Event Timeline

Change 956878 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] slo_template: hardcode time window for SLO dashboards

https://gerrit.wikimedia.org/r/956878

+1 for trying this. Thinking out loud:

  1. With something like this in place should we worry about an alternate workflow to inspect/review a previous quarters SLO dashboard? Or would manually "make editable" and adjust when needed be good enough?
  1. Since mostly empty panels (when rolling over to a new time window) might be understood as broken/missing data, lets include information in the dashboard header to help clarify what is being displayed

I am very in favor of this scheme.

As for implementation, maybe one option would be to have some automated change filed and a random_choice(relevant_people) reviewer/merger? I am not sure how much infra WMF already has for bot-filed Changes on Gerrit (and eventually Gitlab).

This sounds right to me -- thanks @elukey for getting it rolling. Early on, we had talked about autogenerating links for different calendar quarters and adding them to the text panel on top, but my recollection is we decided to spend that energy on Pyrra instead.

  1. With something like this in place should we worry about an alternate workflow to inspect/review a previous quarters SLO dashboard? Or would manually "make editable" and adjust when needed be good enough?

I think it wouldn't even need to be "make editable," this only changes the default range for the time picker, right? So you can still use the time picker to punch in the dates of other quarters. Not as clean as having a list to pick from, but no worse than it is now.

+1 for trying this. Thinking out loud:

  1. With something like this in place should we worry about an alternate workflow to inspect/review a previous quarters SLO dashboard? Or would manually "make editable" and adjust when needed be good enough?
  1. Since mostly empty panels (when rolling over to a new time window) might be understood as broken/missing data, lets include information in the dashboard header to help clarify what is being displayed

For both use cases I added some words to the top text panel, an example is: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/956878

@herron lemme know your thoughts!

@RLazarus thanks for the review!

I think it wouldn't even need to be "make editable," this only changes the default range for the time picker, right? So you can still use the time picker to punch in the dates of other quarters. Not as clean as having a list to pick from, but no worse than it is now.

Ah! Yes you are right, this will be more straightforward than I had thought, great!

Change 956878 merged by Elukey:

[operations/grafana-grizzly@master] slo_template: hardcode time window for SLO dashboards

https://gerrit.wikimedia.org/r/956878

elukey claimed this task.

Change merged! Thanks to all for the feedback :)

I'd say that we can close, and follow up if anything new is added etc.. Please re-open if you don't agree :)