Reported via IRC:
11:56 AM <elukey> I know that you are all at the Summit but I wanted to report something weird that I am seeing with Thanos
11:56 AM <elukey> The SLO dashboards for Lift Wing are showing some weird numbers, like SLO 1500%, so I tried to check the related metrics
11:56 AM <elukey> the first example is https://w.wiki/9S2m
11:56 AM <elukey> the first graph is the SLI metric (availability), the second is the numerator and the third the denominator
11:57 AM <elukey> In theory the third should report the same metrics as the second, plus others (the 5xx errors etc..)
11:57 AM <elukey> if I pick
11:57 AM <elukey> istio_sli_availability_requests_total:increase5m{destination_canonical_service="revertrisk-language-agnostic-predictor-default", destination_service_namespace="revertrisk", prometheus="thanos-rule", response_code="200", site="codfw"}
11:57 AM <elukey> in the second is ~3000, in the third 0
11:58 AM <elukey> not sure if I am missing something obvious or not
12:32 PM <herron> elukey: I'm seeing gaps in the sli panels on the grafana dashboards, hmm I wonder if we're having issues with these underyling recording rules
12:32 PM <elukey> herron: o/
12:33 PM <herron> hey :)
12:33 PM <elukey> yeah that too, but it was something that we already had :(
12:33 PM <elukey> I noticed https://gerrit.wikimedia.org/r/c/operations/puppet/+/992415
12:33 PM <elukey> but not sure if it plays a role or not
12:38 PM <herron> it seems harmless but the timing lines up roughly, I'm having a look through thanos logs
12:45 PM <herron> hmm no I'm off by a month wrt that patch. lines up somewhat with remediation steps in T356788 although still looking
1:01 PM <rzl> elukey: hm! that's weird, you get good data if you try it with `response_code=~"[234].*"` or even `response_code=~"..."`, but nothing for `.*` or if you leave it out
1:03 PM <elukey> rzl: o/ thanks for checking! So with the response_code=~... I get only eqiad with a value, not codfw
1:03 PM <elukey> even stranger :D
1:04 PM <rzl> oh you're right! I was just eyeballing the graph
1:05 PM <rzl> to me it sort of smells like a query issue and not a recording rule problem, but I don't have anything concrete to point at
1:05 PM <elukey> yeah same feeling, it is like hitting different endpoints