Since Traffic team have started reviewing the Varnish SLO dashboard on a regular cadence, they've come back with some feedback on how to improve the dashboard template for that use case. For reference their dashboard is https://grafana.wikimedia.org/d/uIGz8Ak7k/varnish-slo
In particular, so far the audience has mostly been people from the SLO working group who already have SLOs on the brain -- this is our first example of a process where SREs not focused on SLO development are using the dashboard in the course of their work, so the feedback is really valuable as we'd eventually like to get all SRE teams into a process like this.
I'm reproducing @BBlack's comments below, verbatim:
All of these are basically clarity issues - we have a hard time parsing (or can imagine others having a hard time parsing - say upper mgmt or random new SREs not up to speed on the minutiae of WMF SLOs) the meaning of the information, and it can probably be fixed with more labels/information/etc around the data. There might be data issues too, but it's hard to tell when we don't clearly understand the data we're seeing.
- RE: First three panels "Request SLO", "Request SLO Error Budget Usage", and "Request SLO Error Budget Burndown (Budget Remaining)":
- Don't seem to change with the time-picker very much (but sometimes do?), so I assume they always represent the reporting SLO-quarter? If so, can we get a label on them which makes it clear the time range they're reporting based on?
- The second and third panels *seem* to just be the inverse of each other (X = 100-Y) and redundant. Is that true? Should we eliminate one or the other?
- Is the first panel ("Request SLO") actually the average SLO so far in the period, or? Is it a percentage of success at meeting the SLO over the ... whole period, period-so-far? It's unclear what this really means to a casual(--ish) observer. Is this value "the averaged SLI over the whole period thus far, which should ideally be 99.9 or higher as stated by the SLO?" or something more like "The percentage of our success at meeting the 99.9% SLO objective?" (which would be more like two percentages multiplied?).
- Fourth panel - Budget Burndown over time-picker period. What does this mean? It's already lower than the SLO numerically, yet we're still green. Is it that when this reaches zero, we are no longer capable of meeting the SLO for the whole reporting quarter? Does it predict the future as being like the past, or does it look at the future through the rose-colored glass of potential perfection? (e.g. "as long as this is non-zero, you could make the quarter's target, but if the budget burndown number is very small, it might require near-perfection for the rest of the period to make it").
- Last two panels - "Request SLI" - I assume this is "instantaneous" SLI over small time windows. The graph below it is "Request SLO [not I] Error Rate", but seems to just be the inverse of "Request SLI"? Is it actually the SLO, or the SLI inverse, or?
In addition there were one other point that came up in the SLO meeting about this a few months back:
- There are multiple clusters, so the dashboard doesn’t give you a single-page view of everything -- as someone running a regular review meeting, you probably want an overview that gives you all the SLOs x all the clusters, without having to click through.
O11y folks, handing this off to you for triage -- thanks for taking a look! Let me know if I can help -- for questions about the feedback I'd refer you to @BBlack and/or @mark but I'm happy to support any way I can.