From root@titan1001:~# journalctl -u thanos-rule --since -1h | less
Mar 04 08:02:02 titan1001 thanos-rule[2415701]: level=warn ts=2025-03-04T08:02:02.313620275Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/pyrra/output-rules/search-update-lag-codfw.yaml group=search-upda
te-lag name=ErrorBudgetBurn index=10 msg="Evaluating rule failed" rule="alert: ErrorBudgetBurn\nexpr: search_sli_update_lag:bool:burnrate18h{prometheus=\"k8s\",slo=\"search-update-lag\"}\n > (1 * (1 - 0.95)) and search_sli_update_lag:bool:burnrate12d{prometheus=\"k8s\",slo=\"search-update-lag\"}\n > (1 * (1 - 0.95))\nfor: 9h\nlabels:\n exhaustion: 12w\n long: 12d\n prometheus: k8s\n service: search\n severity: warning\n short: 18h\n site: codfw\n slo: search-update-lag\n team: search\n" err="vector contains metrics with the same labelset after applying alert labels"This is the alert/rule from the file
- alert: ErrorBudgetBurn
expr: search_sli_update_lag:bool:burnrate18h{prometheus="k8s",slo="search-update-lag"} > (1 * (1-0.95)) and search_sli_update_lag:bool:burnrate12d{prometheus="k8s",slo="search-update-lag"} > (1 * (1-0.95))
for: 9h
labels:
exhaustion: 12w
long: 12d
prometheus: k8s
service: search
severity: warning
short: 18h
site: eqiad
slo: search-update-lag
team: searchAnd the history of these rules failing (from https://grafana.wikimedia.org/goto/EfmQXxtHg?orgId=1):
Indeed running the rule from thanos.w.o returns two results:
search_sli_update_lag:bool:burnrate18h{prometheus="k8s", recorder="thanos-rule", service="search", site="codfw", slo="search-update-lag", team="search"}
0.14912009879592467
search_sli_update_lag:bool:burnrate18h{prometheus="k8s", recorder="thanos-rule", service="search", site="eqiad", slo="search-update-lag", team="search"}
0.13833951380304904From reading the puppet code, my understanding is that we should be filtering by site here too:
if $datacenter in ['eqiad', 'codfw'] {
pyrra::filesystem::config { "search-update-lag-${datacenter}.yaml":
content => to_yaml({
'apiVersion' => 'pyrra.dev/v1alpha1',
'kind' => 'ServiceLevelObjective',
'metadata' => {
'name' => 'search-update-lag',
'namespace' => 'pyrra-o11y',
'labels' => {
'pyrra.dev/team' => 'search',
'pyrra.dev/service' => 'search',
'pyrra.dev/site' => "${datacenter}", #lint:ignore:only_variable_string
},
},
'spec' => {
'target' => '95',
'window' => '12w',
'indicator' => {
'bool_gauge' => {
'metric' => "search_sli_update_lag:bool{job_name=~\"cirrus_streaming_updater_consumer_search_${datacenter}\", prometheus=\"k8s\"}",
},
},
},
})
}
}