Page MenuHomePhabricator

Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI
Closed, ResolvedPublic

Description

https://grafana-rw.wikimedia.org/d/slo-HAProxy/ and https://grafana-rw.wikimedia.org/d/slo-Varnish/varnish-slo-s has occasional negative values which are unexpected:

Screenshot at 2023-07-11 11-36-35.png (681×1 px, 27 KB)

It appears this most-occurs on codfw but still happens on other DCs.

The logic for calculating these metrics can be found at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/grafana-grizzly/+/refs/heads/master/slo_definitions.libsonnet

Event Timeline

BCornwall changed the task status from Open to In Progress.Jul 11 2023, 6:42 PM
BCornwall triaged this task as Low priority.
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.
BCornwall renamed this task from Investigate why HAProxy SLO Grafana dashboard has negative values on combined SLI to Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI.Jul 17 2023, 9:31 PM
BCornwall updated the task description. (Show Details)
BCornwall updated the task description. (Show Details)

From discussion on IRC:

<brett> I'm noticing that the SLO calculations on grafana seem a little off but I may be missing something. Does anyone have a definitive calculation for remaining budget and measuring SLI? I'm suspicious of the use of increase() in all these since it's entirely possible for the increase of good logs to be larger than the total logs, causing a weird situation (e.g. 1.25/1)
<rzl> brett: can you point me at which SLO you're looking at?
<brett> rzl: Any of them, really (They all seem to follow the same basic formula). For instance, https://grafana-rw.wikimedia.org/d/slo-Etcd/etcd-slo-s
<rzl> there's some brief discussion in the comments at the top of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/grafana-grizzly/+/refs/heads/master/slo_definitions.libsonnet but I'm happy to talk more -- in principle "good events" should always between 0x and 1x "total events," but in some cases we're using a timeseries that doesn't map perfectly to that
<brett> Or https://grafana-rw.wikimedia.org/d/slo-HAProxy/haproxy-slo-s, which is where we notice the negative values on the combined SLI
<brett> > Increase() calculates the increase in the time series in the range vector
<rzl> so in the case of haproxy, that's using counters that are defined here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/mtail/files/programs/cache_haproxy.mtail
<brett> The counters seem pretty foolproof for making sure the goods don't ever go past the totals
<rzl> for any given time interval I think it's reasonable to say haproxy_sli_good <= haproxy_sli_total -- where it slips a tiny bit in either direction, I think that's likely artifacting due to timestamp skew or something along those lines
<rzl> notice also the scale is way, way smaller than anything we see during an actual outage -- we're just zooming in on it because the service is healthy
<rzl> and it definitely doesn't have any effect on the timespan of a quarter, which is where we actually evaluate the SLO, right
<brett> rzl: Did you mean "we evalutate the SLO correctly" or "we evaluate the SLO, yeah?"
<rzl> haha the second one, sorry
<rzl> "which is where we actually evaluate the SLO, do you see what I mean"
<rzl> "do you see what I mean, or additionally correct me if I've got something wrong"
<brett> I see what you're saying. Not comprehending the slip, though. By "slip" do you mean the goods sometimes exceeds the total?
<rzl> I mean, at the points in the graph where you've pointed out it becomes negative, even though I claimed that shouldn't happen
<brett> rzl: Re timestamp skew: Are you suggesting that perhaps that since the counters technically aren't incremented at the same exact time that they might not exactly line up when prometheus samples it?
<rzl> that's the shape of the idea I'm talking about, but I'm not sure where exactly in the pipeline something like that would creep in -- that part is definitely an o11y question
<brett> cool. Thanks again! I'll clarify tomorrow with them
<rzl> I do notice that the size of the spike changes as I zoom in and out - it's always exactly one data point as displayed on the graph, which suggests it's more likely a querying or visualization issue than a real-time sampling issue, but that's as far as I go
<rzl> sounds good, I'll be interested in what you find out :)
[…]
<_joe_> Counters are definitely a thing that can go out of sync
<_joe_> specifically, it doesn't even need to have been recorded to different times
<_joe_> it's enough for the prometheus exporter to look at data sequentially and compute one counter a few fractions of a second earlier than the next one
<_joe_> and for the increase to be fast enough
<_joe_> say you have two counters that are right now identical, on is count_good, the other is count_total
<_joe_> they increase by 100k/second
<_joe_> if we take the snapshots of the counts at a distance of a few milliseconds, we might have a skew of a few pretty easily, as each millisecond will see an increase of 100k
<_joe_> err, 100
<_joe_> now in this case, given how mtail works, I would be surprised of that
[…]
<herron> brett: when I graph good and total for eqiad alongside each other the numbers are nearly identical and do see good exceed total by small amounts at times like you mention.  however I'm wondering why the sli good counter would be following the total counter so closely that we would be considering margin of error from scraping, sampling, etc. as problematic.  I'm wondering if the mtail counters are working as expected
<herron> https://grafana-rw.wikimedia.org/d/Uk3M_9C4k
[…]

Thanks for creating a tracking task! Quickly adding my notes from looking into this:

I created dashboard https://grafana.wikimedia.org/d/Uk3M_9C4k/haproxy-sli-scratchpad?orgId=1 to look more closely at the metrics supporting the SLO, and can see there that the sli good and sli total values follow each other very closely. At times good does exceed total, presumably due to margin of error in scraping and aggregating metrics of very similar value.

In terms of next steps I can think of a few options:

a) Adjust the mtail rules to output an sli "bad" metric (as opposed to the current "good" metric) - with this we could calculate the slo using total - bad = good

b) Go a step further and adjust the mtail rules to output multiple metrics/labels accounting for each of the error/latency states that are currently wrapped up into a single metric - this would allow us allow us to capture and write more specific queries, and allow more granular inspection of sli metrics alongside improved retention (e.g. a change in the slo query wouldn't be obscured in the same way as the current mtail "good" metric)

c) Throw away negative values in the sli panel. An example patch to do that is https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/944287, it would look like https://grafana.wikimedia.org/dashboard/snapshot/Lp5wp0vhvTG9wvQ3UUqPl1LIC8xnoaVI?orgId=1 if merged.

Personally I'd be inclined to apply a near term band-aid like option c, while also pursuing one of the longer term options.

Based on the IRC conversation, wouldn't A have the same issues that we currently have?

I'm not convinced that C is the best way forward since it would be sweeping the problem under the rug rather than band-aiding something. I'd rather have the "wrong" values rather than massaged values.

I need to wrap my head around B a little more before I can comment. :)

Based on the IRC conversation, wouldn't A have the same issues that we currently have?

Today we use 1 - ( good / total ) where if good exceeds total we see a negative value. My thinking is switching to bad / total will avoid that condition, and simplify the config. Plus an sli_bad (or errors) metric should be useful by itself.

bad, good, and total are still not guaranteed to be synced up and are still suspect data, aren't they? It's also entirely possible that many bad requests could exceed total (though much less likely), isn't it? I'm basing this off of @Joe's explanation. If that explanation is correct, is it even reasonably possible to fix this at all; And is it worth attempting to fix this (perhaps instead adding a comment that explains the discrepancies)? I'm thinking it might just be, as you mentioned, a margin-of-error that is just to be accepted as part of data collection.

Thanks for your patience. :)

BCornwall changed the task status from In Progress to Stalled.Aug 24 2023, 7:24 PM

Setting as stalled: Any opinions on how to go about this with my above comment in mind? Thanks!

@BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to a bad and total metric) instead of against it (attempting to maintain identical good and total metric values). That'd be actionable in the near term and would avoid the negative sli with the exception of edge cases where haproxy is serving 100% errors. With that said, looping in @colewhite and @fgiunchedi for their thoughts and potential alternatives

bad, good, and total are still not guaranteed to be synced up and are still suspect data, aren't they? It's also entirely possible that many bad requests could exceed total (though much less likely), isn't it? I'm basing this off of @Joe's explanation. If that explanation is correct, is it even reasonably possible to fix this at all; And is it worth attempting to fix this (perhaps instead adding a comment that explains the discrepancies)? I'm thinking it might just be, as you mentioned, a margin-of-error that is just to be accepted as part of data collection.

Given that we're working within the design of a operational metrics system, as opposed to billing/metering (i.e. need to be exact), that discrepancies of this kind are to be expected. The values are still good for our purposes (e.g. number of errors can be relied upon to signal problems, even within a small margin of error).

@BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to a bad and total metric) instead of against it (attempting to maintain identical good and total metric values). That'd be actionable in the near term and would avoid the negative sli with the exception of edge cases where haproxy is serving 100% errors. With that said, looping in @colewhite and @fgiunchedi for their thoughts and potential alternatives

I think moving to "sli bad" is worth a try (I'm assuming it isn't much work to implement) and see how we fare; under normal circumstances errors is far away from total anyways.

FWIW bad / total is also how we report varnish / haproxy availability here https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic

- name: traffic
  rules:
    - record: site_cluster:haproxy_requests:avail2m
      expr: sum by(site, cluster) (cluster_code:haproxy_frontend_http_responses_total:rate2m{code="5xx"})
        / sum by(site, cluster) (cluster_code:haproxy_frontend_http_responses_total:rate2m{code=~"[12345]xx"})
- record: cluster_code:haproxy_frontend_http_responses_total:rate2m
    expr: sum by(cluster, code) (rate(haproxy_frontend_http_responses_total[2m]))

HTH!

Change 953725 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] mtail: Use "bad" requests for varnish SLI metrics

https://gerrit.wikimedia.org/r/953725

Change 953725 merged by BCornwall:

[operations/puppet@production] mtail: Record bad requests for varnish SLI metrics

https://gerrit.wikimedia.org/r/953725

@herron Thanks for all of your help. We've implemented varnish_sli_bad. I followed the formulae presented at the top of grafana-grizzly's slo_definitions.libsonnet and got these results. They seem to differ from the current dashboard's values, however. Would you be kind enough to double-check to make sure I'm not missing anything? Thank you. :)

BCornwall changed the task status from Stalled to In Progress.Oct 12 2023, 10:36 PM

@herron Thanks for all of your help. We've implemented varnish_sli_bad. I followed the formulae presented at the top of grafana-grizzly's slo_definitions.libsonnet and got these results. They seem to differ from the current dashboard's values, however. Would you be kind enough to double-check to make sure I'm not missing anything? Thank you. :)

Nice, thanks for working on it. Since we were working with inconsistent metrics before I'm not too surprised to see the budget burn improve overall. LGTM!

Change 965842 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/grafana-grizzly@master] slo_definitions: Switch to using varnish_sli_bad

https://gerrit.wikimedia.org/r/965842

Change 965842 merged by BCornwall:

[operations/grafana-grizzly@master] slo_definitions: Switch to using varnish_sli_bad

https://gerrit.wikimedia.org/r/965842

Change 966918 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] mtail: Record bad requests for HAProxy SLI metrics

https://gerrit.wikimedia.org/r/966918

Change 966930 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] mtail: Record bad requests for ATS SLI metrics

https://gerrit.wikimedia.org/r/966930

Change 966930 merged by BCornwall:

[operations/puppet@production] mtail: Record bad requests for ATS SLI metrics

https://gerrit.wikimedia.org/r/966930

Change 966918 merged by BCornwall:

[operations/puppet@production] mtail: Record bad requests for HAProxy SLI metrics

https://gerrit.wikimedia.org/r/966918

Change 973871 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/grafana-grizzly@master] slo_definitions: Switch to using haproxy_sli_bad

https://gerrit.wikimedia.org/r/973871

Change 973872 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/grafana-grizzly@master] slo_definitions: Use trafficserver_backend_sli_bad

https://gerrit.wikimedia.org/r/973872

Change 973871 merged by BCornwall:

[operations/grafana-grizzly@master] slo_definitions: Switch to using haproxy_sli_bad

https://gerrit.wikimedia.org/r/973871

Change 973872 merged by BCornwall:

[operations/grafana-grizzly@master] slo_definitions: Use trafficserver_backend_sli_bad

https://gerrit.wikimedia.org/r/973872