Page MenuHomePhabricator

Search update lag SLOs rule evaluation failure
Closed, ResolvedPublic

Description

From root@titan1001:~# journalctl -u thanos-rule --since -1h | less

Mar 04 08:02:02 titan1001 thanos-rule[2415701]: level=warn ts=2025-03-04T08:02:02.313620275Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/search-update-lag-codfw.yaml group=search-upda
te-lag name=ErrorBudgetBurn index=10 msg="Evaluating rule failed" rule="alert: ErrorBudgetBurn\nexpr: search_sli_update_lag:bool:burnrate18h{prometheus=\"k8s\",slo=\"search-update-lag\"}\n  > (1 * (1 - 0.95)) and search_sli_update_lag:bool:burnrate12d{prometheus=\"k8s\",slo=\"search-update-lag\"}\n  > (1 * (1 - 0.95))\nfor: 9h\nlabels:\n  exhaustion: 12w\n  long: 12d\n  prometheus: k8s\n  service: search\n  severity: warning\n  short: 18h\n  site: codfw\n  slo: search-update-lag\n  team: search\n" err="vector contains metrics with the same labelset after applying alert labels"

This is the alert/rule from the file

- alert: ErrorBudgetBurn
  expr: search_sli_update_lag:bool:burnrate18h{prometheus="k8s",slo="search-update-lag"} > (1 * (1-0.95)) and search_sli_update_lag:bool:burnrate12d{prometheus="k8s",slo="search-update-lag"} > (1 * (1-0.95))
  for: 9h
  labels:
    exhaustion: 12w
    long: 12d
    prometheus: k8s
    service: search
    severity: warning
    short: 18h
    site: eqiad
    slo: search-update-lag
    team: search

And the history of these rules failing (from https://grafana.wikimedia.org/goto/EfmQXxtHg?orgId=1):

2025-03-04-091353_3753x1572_scrot.png (1×3 px, 158 KB)

Indeed running the rule from thanos.w.o returns two results:

search_sli_update_lag:bool:burnrate18h{prometheus="k8s", recorder="thanos-rule", service="search", site="codfw", slo="search-update-lag", team="search"}
0.14912009879592467
search_sli_update_lag:bool:burnrate18h{prometheus="k8s", recorder="thanos-rule", service="search", site="eqiad", slo="search-update-lag", team="search"}
0.13833951380304904

From reading the puppet code, my understanding is that we should be filtering by site here too:

if $datacenter in ['eqiad', 'codfw'] {
    pyrra::filesystem::config { "search-update-lag-${datacenter}.yaml":
      content => to_yaml({
        'apiVersion' => 'pyrra.dev/v1alpha1',
        'kind' => 'ServiceLevelObjective',
        'metadata' => {
            'name' => 'search-update-lag',
            'namespace' => 'pyrra-o11y',
            'labels' => {
                'pyrra.dev/team' => 'search',
                'pyrra.dev/service' => 'search',
                'pyrra.dev/site' => "${datacenter}",  #lint:ignore:only_variable_string
            }, 
        },
        'spec' => {
            'target' => '95',
            'window' => '12w',
            'indicator' => {
                'bool_gauge' => {
                        'metric' => "search_sli_update_lag:bool{job_name=~\"cirrus_streaming_updater_consumer_search_${datacenter}\", prometheus=\"k8s\"}",
                },
            },  
        },      
      })        
    }       
}

Event Timeline

Change #1124365 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pyrra: limit search-update-lag to the correct site

https://gerrit.wikimedia.org/r/1124365

Change #1124365 merged by Herron:

[operations/puppet@production] pyrra: limit search-update-lag to the correct site

https://gerrit.wikimedia.org/r/1124365

fgiunchedi claimed this task.