The SLOMetricAbsent alert defined by Pyrra seems to be occasionally firing alerts for SLO metrics
The most recent instance was:
Which is relating to metric haproxy_sli_total{cluster=~"cache_upload",site=~"magru"}
| • herron | |
| Jul 11 2024, 5:54 PM |
| F59974582: 2025-05-14-175245_3659x1022_scrot.png | |
| May 14 2025, 3:53 PM |
| F56359040: Screenshot 2024-07-11 at 1.49.50 PM.png | |
| Jul 11 2024, 5:54 PM |
The SLOMetricAbsent alert defined by Pyrra seems to be occasionally firing alerts for SLO metrics
The most recent instance was:
Which is relating to metric haproxy_sli_total{cluster=~"cache_upload",site=~"magru"}
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | fgiunchedi | T383570 thanos query/store OOM on titan hosts | |||
| Open | • herron | T369854 Occasional SLOMetricAbsent alerts |
Reviewing thanos-rule logs I'm seeing related discards with err="out of order sample"
Jul 11 14:33:03 titan1001 thanos-rule[952]: level=warn ts=2024-07-11T14:33:03.340415365Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-esams-cache_upload.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1720707743 @[1720707743760]"
Jul 11 14:33:03 titan1001 thanos-rule[952]: level=warn ts=2024-07-11T14:33:03.342204239Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-eqiad-cache_upload.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"pending\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1 @[1720707723570]"
Jul 11 14:33:03 titan1001 thanos-rule[952]: level=warn ts=2024-07-11T14:33:03.342570299Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-eqiad-cache_upload.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1720707723 @[1720707723570]"
Jul 11 14:33:03 titan1001 thanos-rule[952]: level=warn ts=2024-07-11T14:33:03.343704738Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-eqiad-cache_text.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"pending\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1 @[1720707726460]"
Jul 11 14:33:03 titan1001 thanos-rule[952]: level=warn ts=2024-07-11T14:33:03.343756154Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-eqiad-cache_text.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1720707726 @[1720707726460]"more in P66323
SLOMetricAbsent for dead_letters_hits, varnish_sli_bad, trafficserver_backend_sli_bad and haproxy_sli_bad occurred today, at two different times.
Jul 18 05:00:42 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:00:42.511200613Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"pending\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1 @[1721278780985]"
Jul 18 05:00:42 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:00:42.511317641Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1721278780 @[1721278780985]"
Jul 18 05:04:34 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:04:34.794895888Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=1 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"pending\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1 @[1721279020985]"
Jul 18 05:04:34 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:04:34.795017365Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=1 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1721278780 @[1721279020985]"
Jul 18 05:04:39 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:04:39.070666066Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"pending\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1 @[1721279020985]"
Jul 18 05:04:39 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:04:39.070773988Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1721278780 @[1721279020985]"
Jul 18 05:15:47 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:15:47.460502445Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-magru-cache_upload.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="duplicate sample for timestamp" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1721279742 @[1721279742347]"
Jul 18 05:15:50 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:15:50.093091886Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-magru-cache_text.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"pending\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1 @[1721279641921]"
Jul 18 05:15:50 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:15:50.093159189Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-magru-cache_text.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1721279641 @[1721279641921]"
Jul 18 05:15:50 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:15:50.556020389Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_upload.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="duplicate sample for timestamp" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1721279746 @[1721279746972]"
Jul 18 05:19:48 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:19:48.864314724Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/haproxy-combined-magru-cache_upload.yaml group=haproxy-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="duplicate sample for timestamp" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"haproxy-combined\", team=\"traffic\"} => 1721279742 @[1721279982347]"
Jul 18 05:19:48 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:19:48.865670967Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS\", alertname=\"SLOMetricAbsent\", alertstate=\"firing\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1 @[1721279980985]"
Jul 18 05:19:48 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:19:48.865713521Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_text.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="out of order sample" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1721278780 @[1721279980985]"
Jul 18 05:19:49 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:19:49.54384128Z caller=manager.go:684 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_upload.yaml group=trafficserver-combined-increase name=SLOMetricAbsent index=3 msg="Rule evaluation result discarded" err="duplicate sample for timestamp" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"SLOMetricAbsent\", service=\"haproxy\", severity=\"critical\", slo=\"trafficserver-combined\", team=\"traffic\"} => 1721279746 @[1721279986972]"Looks like prometheus7001 wasn't reachable around 05:00
Jul 18 04:58:31 titan2001 thanos-query[955]: level=warn ts=2024-07-18T04:58:31.352785469Z caller=endpointset.go:446 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from prometheus7001:29900: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prometheus7001:29900
Jul 18 05:05:43 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:05:43.033792247Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/varnish-combined-magru-cache_text.yaml group=varnish-combined name=varnish_sli_all:burnrate3d index=5 msg="Evaluating rule failed" rule="record: varnish_sli_all:burnrate3d\nexpr: sum(rate(varnish_sli_bad{cluster=\"cache_text\",site=\"magru\"}[3d])) / sum(rate(varnish_sli_all{cluster=\"cache_text\",site=\"magru\"}[3d]))\nlabels:\n cluster: cache_text\n service: varnish\n site: magru\n slo: varnish-combined\n team: traffic\n" err="no query API server reachable"
Jul 18 05:09:15 titan2001 thanos-rule[1147057]: level=error ts=2024-07-18T05:09:15.869922618Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: prometheus7001:29900 LabelSets: {prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"magru\\\"} Mint: 1715092697261 Maxt: 9223372036854775807: rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\n" query="sum(rate(trafficserver_backend_sli_bad{cluster=~\"cache_upload\",site=~\"magru\"}[3h])) / sum(rate(trafficserver_backend_sli_total{cluster=~\"cache_upload\",site=~\"magru\"}[3h]))"
Jul 18 05:09:15 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T05:09:15.870026402Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/trafficserver-combined-magru-cache_upload.yaml group=trafficserver-combined name=trafficserver_backend_sli:burnrate3h index=2 msg="Evaluating rule failed" rule="record: trafficserver_backend_sli:burnrate3h\nexpr: sum(rate(trafficserver_backend_sli_bad{cluster=~\"cache_upload\",site=~\"magru\"}[3h]))\n / sum(rate(trafficserver_backend_sli_total{cluster=~\"cache_upload\",site=~\"magru\"}[3h]))\nlabels:\n cluster: cache_upload\n service: haproxy\n site: magru\n slo: trafficserver-combined\n team: traffic\n" err="no query API server reachable"and prometheus 2005 wasn't reachable around 12:00
Jul 18 11:47:51 titan2001 thanos-query[955]: level=warn ts=2024-07-18T11:47:51.353067338Z caller=endpointset.go:446 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info fr om prometheus2005:29906: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prometheus2005:29906
Jul 18 11:49:48 titan2001 thanos-rule[1147057]: level=error ts=2024-07-18T11:49:48.812050756Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: prometheus2005:29907 LabelSets: {prometheus=\\\"k8s-staging\\\", replica=\\\"a\\\", site=\\\"codfw\\\"} Mint: 1720007248066 Maxt: 9223372036854775807: rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\n" query="sum(rate(varnish_sli_bad{cluster=\"cache_text\",site=\"codfw\"}[15m])) / sum(rate(varnish_sli_all{cluster=\"cache_text\",site=\"codfw\"}[15m]))"
Jul 18 11:49:48 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T11:49:48.812145855Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/varnish-combined-codfw-cache_text.yaml group=varnish-combined name=varnish_sli_all:burnrate15m index=0 msg="Evaluating rule failed" rule="record: varnish_sli_all:burnrate15m\nexpr: sum(rate(varnish_sli_bad{cluster=\"cache_text\",site=\"codfw\"}[15m])) / sum(rate(varnish_sli_all{cluster=\"cache_text\",site=\"codfw\"}[15m]))\nlabels:\n cluster: cache_text\n service: varnish\n site: codfw\n slo: varnish-combined\n team: traffic\n" err="no query API server reachable"
Jul 18 11:49:49 titan2001 thanos-rule[1147057]: level=error ts=2024-07-18T11:49:49.375134207Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: prometheus2005:29906 LabelSets: {prometheus=\\\"k8s\\\", replica=\\\"a\\\", site=\\\"codfw\\\"} Mint: 1720007244037 Maxt: 9223372036854775807: rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\n" query="1 - sum(log_dead_letters_hits:increase12w{site=\"codfw\",slo=\"logstash-requests-pilot\"} or vector(0)) / sum(logstash_node_plugin_events_out:increase12w{plugin_id=\"output/opensearch/logstash\",site=\"codfw\",slo=\"logstash-requests-pilot\"})"
Jul 18 11:49:49 titan2001 thanos-rule[1147057]: level=warn ts=2024-07-18T11:49:49.37521895Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/logstash-requests-codfw.yaml group=logstash-requests-pilot-generic name=pyrra_availability index=2 msg="Evaluating rule failed" rule="record: pyrra_availability\nexpr: 1 - sum(log_dead_letters_hits:increase12w{site=\"codfw\",slo=\"logstash-requests-pilot\"}\n or vector(0)) / sum(logstash_node_plugin_events_out:increase12w{plugin_id=\"output/opensearch/logstash\",site=\"codfw\",slo=\"logstash-requests-pilot\"})\nlabels:\n service: logging\n site: codfw\n slo: logstash-requests-pilot\n team: o11y\n" err="no query API server reachable"Jul 18 11:51:09 prometheus2005 prometheus@ops[1084]: ts=2024-07-18T11:51:09.564Z caller=notifier.go:530 level=error component=notifier alertmanager=http://alert2001.wikimedia.org:9093/api/v2/alerts count=36 msg="Error sending alert" err="Post \"http://alert2001.wikimedia.org:9093/api/v2/alerts\": context deadline exceeded" Jul 18 11:51:19 prometheus2005 prometheus@ops[1084]: ts=2024-07-18T11:51:19.566Z caller=notifier.go:530 level=error component=notifier alertmanager=http://alert2001.wikimedia.org:9093/api/v2/alerts count=11 msg="Error sending alert" err="Post \"http://alert2001.wikimedia.org:9093/api/v2/alerts\": dial tcp [2620:0:860:3:208:80:153:84]:9093: connect: no route to host"
Seems these fire during connectivity issues between thanos-rule and prometheus instances. So not necessarily false positives as originally thought, but also not particularly actionable from an SLO perspective. I'm going to look into increasing the time threshold for SLOMetricAbsent as a next step
Had a closer look into the threshold tunables and I don't actually see a way to change this natively within Pyrra. As-is the "for" duration of SLOMetricAbsent alerts is 6m. Pyrra has options to enable/disable the absent alert, but maybe we can configure something to put this alert in a silence/inhibit waiting room for 30-60m before it alerts (or recovers on its own)
Of course it'd be even better to prevent the error condition leading into this as well, although to some extent it is limited by architecture e.g. single instances in the pops.
I've looked a little bit into this issue and my hunch for "out of order samples" issue is that whenever rule evaluation gets slow (e.g. network conditions) then a race condition can be introduced. Also ATM unclear to me why the affected metrics with out of order samples are ALERTS and ALERTS_FOR_STATE. At any rate, it seems to me the simplest and easiest option is to ask pyrra not to generate the absent() rules, and possibly re-evaluate (hah!) later. HTH
Sadly, this is still happening:
08:20 <+jinxer-wm> FIRING: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:21 <+jinxer-wm> FIRING: SLOMetricAbsent: varnish-combined drmrs - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:23 <+jinxer-wm> FIRING: SLOMetricAbsent: haproxy-combined <no value> - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:26 <+jinxer-wm> FIRING: [2x] SLOMetricAbsent: varnish-combined drmrs - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:28 <+jinxer-wm> FIRING: [2x] SLOMetricAbsent: haproxy-combined <no value> - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:30 <+jinxer-wm> RESOLVED: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:31 <+jinxer-wm> RESOLVED: [2x] SLOMetricAbsent: varnish-combined drmrs - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent 08:33 <+jinxer-wm> RESOLVED: [2x] SLOMetricAbsent: haproxy-combined <no value> - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
I'm also seeing this occur with the search team as well.
linking with T383570 since these alerts are evaluated by thanos rule which depends on thanos query
While SLOMetricAbsent can fire when thanos-query is in trouble, I doubt thanos-query troubles account for periodic failures, e.g. https://logstash.wikimedia.org/goto/f1cd810606254d69fb57ab369179612a