Page MenuHomePhabricator

SLO dashboards for Lift Wing showing unexpected values
Closed, ResolvedPublic5 Estimated Story Points

Description

Reported via IRC:

11:56 AM <elukey> I know that you are all at the Summit but I wanted to report something weird that I am seeing with Thanos
11:56 AM <elukey> The SLO dashboards for Lift Wing are showing some weird numbers, like SLO 1500%, so I tried to check the related metrics
11:56 AM <elukey> the first example is https://w.wiki/9S2m
11:56 AM <elukey> the first graph is the SLI metric (availability), the second is the numerator and the third the denominator
11:57 AM <elukey> In theory the third should report the same metrics as the second, plus others (the 5xx errors etc..)
11:57 AM <elukey> if I pick 
11:57 AM <elukey> istio_sli_availability_requests_total:increase5m{destination_canonical_service="revertrisk-language-agnostic-predictor-default", destination_service_namespace="revertrisk", prometheus="thanos-rule", response_code="200", site="codfw"}
11:57 AM <elukey> in the second is ~3000, in the third 0
11:58 AM <elukey> not sure if I am missing something obvious or not
12:32 PM <herron> elukey: I'm seeing gaps in the sli panels on the grafana dashboards, hmm I wonder if we're having issues with these underyling recording rules
12:32 PM <elukey> herron: o/
12:33 PM <herron> hey :)
12:33 PM <elukey> yeah that too, but it was something that we already had :(
12:33 PM <elukey> I noticed https://gerrit.wikimedia.org/r/c/operations/puppet/+/992415
12:33 PM <elukey> but not sure if it plays a role or not
12:38 PM <herron> it seems harmless but the timing lines up roughly, I'm having a look through thanos logs
12:45 PM <herron> hmm no I'm off by a month wrt that patch.  lines up somewhat with remediation steps in T356788 although still looking
1:01 PM <rzl> elukey: hm! that's weird, you get good data if you try it with `response_code=~"[234].*"` or even `response_code=~"..."`, but nothing for `.*` or if you leave it out
1:03 PM <elukey> rzl: o/ thanks for checking! So with the response_code=~... I get only eqiad with a value, not codfw
1:03 PM <elukey> even stranger :D
1:04 PM <rzl> oh you're right! I was just eyeballing the graph
1:05 PM <rzl> to me it sort of smells like a query issue and not a recording rule problem, but I don't have anything concrete to point at
1:05 PM <elukey> yeah same feeling, it is like hitting different endpoints

Event Timeline

Spent some time looking into this, and I'm also wondering why these queries seem to yield different results when ran over more than ~15d

For instance this query gives consistent results with sum_over_time[7d]

While same query over 90d I see results change each time it is executed with sum_over_time[90d]

The OOM remediation steps in T356788 come to mind as worth ruling out since its a fairly recent change in thanos, more specifically https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006006. We shouldn't be hitting this 1 year max with our 90d queries, but maybe an unexpected side effect

@fgiunchedi what do you think?

Tried to compare the SLI availability thanos rule vs the actual istio metric, more or less using the same aggregation: https://w.wiki/9Tqd

It seems that the second panel (original istio metrics) is more consistent and it doesn't cause any weird value (like > 1.0 / 100%). The first graph is what comes from Thanos, and the result is way different. So in theory it seems a problem with Thanos queries themselves, afaics...

The definition of the Thanos rule is:

- record: istio_sli_availability_requests_total:increase5m
  expr: sum by (destination_canonical_service, destination_service_namespace, response_code, site, prometheus) (increase(istio_requests_total{kubernetes_namespace="istio-system"}[5m]))

The evaluation timings for the istio availability SLI seems in the order of one/two seconds, meanwhile the latency one got worse from the last time that I check (before the parternity leave, ~3 months ago):

https://thanos.wikimedia.org/rules#istio_slos

I think it may be due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/989458, probably we are now onboarding too many metrics (not sure if related to the current issue, but worth mentioning).

Change 1011146 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] profile::thanos: Add latency histogram buckets back for Istio

https://gerrit.wikimedia.org/r/1011146

Change 1011146 merged by Klausman:

[operations/puppet@production] profile::thanos: Add latency histogram buckets back for Istio

https://gerrit.wikimedia.org/r/1011146

calbon set the point value for this task to 5.
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

@herron something really strange: https://w.wiki/9bMW

I compared the recording rule with the actual metric, trying to aggregate with the same labels, and the results of the recording rule are strange. I see both eqiad and codfw traffic reported, while the original metric shows mostly eqiad traffic. Anything that comes up to mind on Thanos that could cause this?

@herron something really strange: https://w.wiki/9bMW

I compared the recording rule with the actual metric, trying to aggregate with the same labels, and the results of the recording rule are strange. I see both eqiad and codfw traffic reported, while the original metric shows mostly eqiad traffic. Anything that comes up to mind on Thanos that could cause this?

Looking at this link I'm seeing gaps in the top panel (recording rule) which go away go away when deduplication is disabled. When I set both panels to 12h with deduplication off the graphs are more what I'd expect to see. Not sure off hand why that is, will keep digging

Change #1017873 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos-rule: stop overriding site and prometheus labels

https://gerrit.wikimedia.org/r/1017873

Change #1017873 merged by Herron:

[operations/puppet@production] thanos-rule: stop overriding site and prometheus labels

https://gerrit.wikimedia.org/r/1017873

Change #1024790 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] istio_slos: add secondary recording rules

https://gerrit.wikimedia.org/r/1024790

Change #1024790 merged by Herron:

[operations/puppet@production] istio_slos: add secondary recording rules

https://gerrit.wikimedia.org/r/1024790

I rechecked our dashboards, and after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 I don't see anymore the weird values above 100%.

@herron let's double check, maybe we can drop the secondary rules and keep going with the "regular" ones?

Example of the fix: https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s?orgId=1&from=2024-03-01%2000:00:00&to=2024-05-31%2023:59:59

Change #1046690 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] Revert "istio_slos: add secondary recording rules"

https://gerrit.wikimedia.org/r/1046690

@herron let's double check, maybe we can drop the secondary rules and keep going with the "regular" ones?

Example of the fix: https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s?orgId=1&from=2024-03-01%2000:00:00&to=2024-05-31%2023:59:59

Sounds good to me, uploaded a revert patch

Change #1046690 merged by Herron:

[operations/puppet@production] Revert "istio_slos: add secondary recording rules"

https://gerrit.wikimedia.org/r/1046690

andrea.denisse subscribed.
elukey claimed this task.

Tentatively closing this, I'll reopen in case we resurface it while testing SLOs in Pyrra.