Page MenuHomePhabricator

SLO dashboard refinements
Closed, ResolvedPublic

Description

Since Traffic team have started reviewing the Varnish SLO dashboard on a regular cadence, they've come back with some feedback on how to improve the dashboard template for that use case. For reference their dashboard is https://grafana.wikimedia.org/d/uIGz8Ak7k/varnish-slo

In particular, so far the audience has mostly been people from the SLO working group who already have SLOs on the brain -- this is our first example of a process where SREs not focused on SLO development are using the dashboard in the course of their work, so the feedback is really valuable as we'd eventually like to get all SRE teams into a process like this.

I'm reproducing @BBlack's comments below, verbatim:

All of these are basically clarity issues - we have a hard time parsing (or can imagine others having a hard time parsing - say upper mgmt or random new SREs not up to speed on the minutiae of WMF SLOs) the meaning of the information, and it can probably be fixed with more labels/information/etc around the data. There might be data issues too, but it's hard to tell when we don't clearly understand the data we're seeing.

  • RE: First three panels "Request SLO", "Request SLO Error Budget Usage", and "Request SLO Error Budget Burndown (Budget Remaining)":
    • Don't seem to change with the time-picker very much (but sometimes do?), so I assume they always represent the reporting SLO-quarter? If so, can we get a label on them which makes it clear the time range they're reporting based on?
    • The second and third panels *seem* to just be the inverse of each other (X = 100-Y) and redundant. Is that true? Should we eliminate one or the other?
    • Is the first panel ("Request SLO") actually the average SLO so far in the period, or? Is it a percentage of success at meeting the SLO over the ... whole period, period-so-far? It's unclear what this really means to a casual(--ish) observer. Is this value "the averaged SLI over the whole period thus far, which should ideally be 99.9 or higher as stated by the SLO?" or something more like "The percentage of our success at meeting the 99.9% SLO objective?" (which would be more like two percentages multiplied?).
  • Fourth panel - Budget Burndown over time-picker period. What does this mean? It's already lower than the SLO numerically, yet we're still green. Is it that when this reaches zero, we are no longer capable of meeting the SLO for the whole reporting quarter? Does it predict the future as being like the past, or does it look at the future through the rose-colored glass of potential perfection? (e.g. "as long as this is non-zero, you could make the quarter's target, but if the budget burndown number is very small, it might require near-perfection for the rest of the period to make it").
  • Last two panels - "Request SLI" - I assume this is "instantaneous" SLI over small time windows. The graph below it is "Request SLO [not I] Error Rate", but seems to just be the inverse of "Request SLI"? Is it actually the SLO, or the SLI inverse, or?

In addition there were one other point that came up in the SLO meeting about this a few months back:

  • There are multiple clusters, so the dashboard doesn’t give you a single-page view of everything -- as someone running a regular review meeting, you probably want an overview that gives you all the SLOs x all the clusters, without having to click through.

O11y folks, handing this off to you for triage -- thanks for taking a look! Let me know if I can help -- for questions about the feedback I'd refer you to @BBlack and/or @mark but I'm happy to support any way I can.

Event Timeline

lmata triaged this task as Medium priority.

Hi @RLazarus,

Will discuss with @herron and address the feedback with any notes. Thanks!

To take a step back, the varnish slo dashboard linked in the description didn't actually originate from a template. Presumably this one was a manual fork of the original etcd slo example dashboard that's been manually adjusted.

Many of these points in the description I think are addressed by the template (and can of course be tuned further), so let's use the templated dashboard as the baseline for clarification and improvements, here's that link for varnish https://grafana-rw.wikimedia.org/d/slo-varnish-tmpl/varnish-slos-grizzly-template

FWIW templated dashboards will be tagged as 'SLO' and 'Grizzly' e.g. https://grafana-rw.wikimedia.org/?orgId=1&search=open&query=&tag=SLO

In the case of the varnish-slo grizzly/grafonnet dashboard:

  • A Varnish latency SLO has not yet been defined afaik, these panels show 'no data' currently
  • Varnish request SLO queries don't currently utilize the site variable (we can of course change this), so it's an aggregate across all sites, which I think relates to this point below

There are multiple clusters, so the dashboard doesn’t give you a single-page view of everything -- as someone running a regular review meeting, you probably want an overview that gives you all the > SLOs x all the clusters, without having to click through.

  • Currently the template has a pull down and shows one cluster at a time, but we can update this as desired (in addition to the bit about $site above). Should we display the status for all clusters/sites by default, and provide the option to filter more specifically? What would be the most clear there?

All of these are basically clarity issues - we have a hard time parsing (or can imagine others having a hard time parsing - say upper mgmt or random new SREs not up to speed on the minutiae of WMF SLOs) the meaning of the information, and it can probably be fixed with more labels/information/etc around the data. There might be data issues too, but it's hard to tell when we don't clearly understand the data we're seeing.

The templated dashboards aim to be straight to the point regarding current SLO standing, but very open to ideas for improving clarity. In addition I've created T302995 as a placeholder to explore if/how layering a non-grafana SLO interface may help simplify how SLOs are presented for these types of cases.

To take a step back, the varnish slo dashboard linked in the description didn't actually originate from a template. Presumably this one was a manual fork of the original etcd slo example dashboard that's been manually adjusted.

Argh. :) Okay, thanks - let's indeed start from the templated version. If no objections, I think we should delete the nontemplated dashboard as a hazard to navigation.

  • A Varnish latency SLO has not yet been defined afaik, these panels show 'no data' currently

Varnish uses a single combined latency-availability SLO, defined here -- the metric to monitor is the the fraction (requests that complete successfully and on time) / (total requests). So there's only one SLI, and any request that comes back with an error or latency due to Varnish counts against the SLI equally. varnish_sli_good and varnish_sli_all are the counters for the top and bottom of that fraction; see modules/mtail/files/programs/varnishsli.mtail for the definition of varnish_sli_good.

On the templated dashboard, it looks like that's what "request error budget" and "request error ratio" are already measuring, so we can just give it a clearer label and delete the unused separate latency graphs.

  • Varnish request SLO queries don't currently utilize the site variable (we can of course change this), so it's an aggregate across all sites, which I think relates to this point below
  • Currently the template has a pull down and shows one cluster at a time, but we can update this as desired (in addition to the bit about $site above). Should we display the status for all clusters/sites by default, and provide the option to filter more specifically? What would be the most clear there?

This should be clearer in the SLO document, I'll update it shortly, but we do monitor each site separately -- for example in a given quarter it's possible for Varnish to meet its SLO in codfw but violate it in esams.

That means the team, reviewing the dashboard each week, should ideally be able to see numbers for all the sites at a glance. I like your proposal of displaying everything with the option to filter -- we'd talked about having the site-specific view also provide more details, but for now it seems like we have plenty of room to show everything at once.


Overall though: I agree, let's take a step back and gather real-world feedback again using the correct page as a starting point. With any luck we'll be able to have a quicker feedback loop this time around.

Change 768108 had a related patch set uploaded (by Herron; author: Herron):

[operations/grafana-grizzly@master] varnish_slo: enable multi/all selectors and update query to include $site

https://gerrit.wikimedia.org/r/768108

If no objections, I think we should delete the nontemplated dashboard as a hazard to navigation.

Sounds good, I've move the manual dashboards into the "to be deleted" folder, and added a deprecation notice to them with a link to the up to date dashboards.

Varnish uses a single combined latency-availability SLO, defined here -- the metric to monitor is the the fraction (requests that complete successfully and on time) / (total requests). So there's only one SLI, and any request that comes back with an error or latency due to Varnish counts against the SLI equally. varnish_sli_good and varnish_sli_all are the counters for the top and bottom of that fraction; see modules/mtail/files/programs/varnishsli.mtail for the definition of varnish_sli_good.

Ah! Is a combined SLO preferred and/or something we expect to apply for other services? I ask because IIUC we could also update this to output individual latency and error SLO metrics for tracking independently, and could option to create our own latency bucket metrics as well.

I like your proposal of displaying everything with the option to filter -- we'd talked about having the site-specific view also provide more details, but for now it seems like we have plenty of room to show everything at once.

Ok, just uploaded a patch for this, please lmk what you think.

Overall though: I agree, let's take a step back and gather real-world feedback again using the correct page as a starting point. With any luck we'll be able to have a quicker feedback loop this time around.

SGTM, and also appreciate the detailed task, I think that's a great approach, thank you for organizing it.

Change 768108 merged by Herron:

[operations/grafana-grizzly@master] varnish_slo: enable multi/all selectors and display all sites in panels

https://gerrit.wikimedia.org/r/768108

Change 772923 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/grafana-grizzly@master] slo: Move most of the text panel content to a description field, so it can be overridden

https://gerrit.wikimedia.org/r/772923

Change 772923 merged by Herron:

[operations/grafana-grizzly@master] slo: Move most of the text panel content to a description field, so it can be overridden

https://gerrit.wikimedia.org/r/772923

Change 776992 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/grafana-grizzly@master] slo: Set a custom description for the Varnish dashboard

https://gerrit.wikimedia.org/r/776992

Change 776992 merged by RLazarus:

[operations/grafana-grizzly@master] slo: Set a custom description for the Varnish dashboard

https://gerrit.wikimedia.org/r/776992

Change 802646 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/grafana-grizzly@master] slo: Correct queries for error budget remaining

https://gerrit.wikimedia.org/r/802646

Change 802646 merged by RLazarus:

[operations/grafana-grizzly@master] slo: Correct queries for error budget remaining

https://gerrit.wikimedia.org/r/802646

I think this is resolvable at this point. Please reopen if I am mistaken!