Page MenuHomePhabricator

Prometheus rule evaluation failure
Closed, ResolvedPublic

Description

Today titan2001 alerted briefly (it cleared by itself after a few minutes) for rule evaluation failures

{cluster="titan", instance="titan2001:17902", job="thanos-rule", prometheus="ops", rule_group="/srv/thanos-rule/.tmp-rules/ABORT/etc/thanos-rule/rules/recording_rules.yaml;service_slis", site="codfw", strategy="abort"}

Looking into logs, thanos-rule on titan2001 was throwing too many open files:

dial tcp: lookup thanos-swift.discovery.wmnet on 10.3.0.1:53: dial udp 10.3.0.1:53: socket: too many open files

Full log messages:

Sep 20 17:02:43 titan2001 thanos-rule[210150]: level=error ts=2023-09-20T17:02:43.840482127Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: localhost:11901 LabelSets: {prometheus=\\\"analytics\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"analytics\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"analytics\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"analytics\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"aux-k8s\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"aux-k8s\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ext\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"ext\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ext\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"ext\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-aux\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-aux\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-dse\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-dse\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-mlstaging\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-mlstaging\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"drmrs\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"eqsin\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"esams\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"ulsfo\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"eqsin\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"esams\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"ulsfo\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"drmrs\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"eqsin\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"esams\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"ulsfo\\\"},{prometheus=\\\"services\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"services\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"services\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"services\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{replica=\\\"a\\\"},{replica=\\\"b\\\"} Mint: 1591833600000 Maxt: 1693933343962: rpc error: code = Unknown desc = receive series from 01H3HCD9JHDMW8JNZ4DQVV0KQT: load chunks: get range reader: Get \\\"https://thanos-swift.discovery.wmnet/thanos/01H3HCD9JHDMW8JNZ4DQVV0KQT/chunks/000106\\\": dial tcp: lookup thanos-swift.discovery.wmnet on 10.3.0.1:53: dial udp 10.3.0.1:53: socket: too many open files\"}\n" query="100 * sum by (cluster, site) (increase(etcd_http_failed_total{code=~\"5..\"}[92d])) / sum by (cluster, site) (increase(etcd_http_received_total[92d]))"
Sep 20 17:02:43 titan2001 thanos-rule[210150]: level=warn ts=2023-09-20T17:02:43.840578249Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/thanos-rule/rules/recording_rules.yaml group=service_slis name=cluster_site:sli_etcd_http_error_ratio:increase92d index=2 msg="Evaluating rule failed" rule="record: cluster_site:sli_etcd_http_error_ratio:increase92d\nexpr: 100 * sum by (cluster, site) (increase(etcd_http_failed_total{code=~\"5..\"}[92d]))\n  / sum by (cluster, site) (increase(etcd_http_received_total[92d]))\n" err="no query API server reachable"

Event Timeline

Change 959674 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: bump max open files for query/rule/compact

https://gerrit.wikimedia.org/r/959674

Had a quick look at the current file limits for thanos-rule on titan2001, I'm seeing 524k as the current limit

titan2001:~# cat /proc/$(pgrep -f "thanos rule")/limits | egrep '(Limit|files)'
Limit                     Soft Limit           Hard Limit           Units
Max open files            524288               524288               files

With low current usage

titan2001:~# ls /proc/$(pgrep -f "thanos rule")/fd | wc -l
82

Unfortunately I didn't capture utilization while the alert was firing, if it fires again I'll try to capture that.

Change 959674 abandoned by Filippo Giunchedi:

[operations/puppet@production] thanos: bump max open files for query/rule/compact

Reason:

I don't think this is the issue

root@titan1001:~# systemctl show thanos-rule | grep -i limitnofile
LimitNOFILE=524288
LimitNOFILESoft=1024
root@titan1001:~# systemctl show thanos-query | grep -i limitnofile
LimitNOFILE=524288
LimitNOFILESoft=1024
root@titan1001:~# systemctl show thanos-compact | grep -i limitnofile
LimitNOFILE=524288
LimitNOFILESoft=1024

https://gerrit.wikimedia.org/r/959674

Change 960008 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: bump store max open files

https://gerrit.wikimedia.org/r/960008

It isn't thanos-rule itself reporting the error message, but thanos-store that rule talks to; in other words the error message is nested

Change 960008 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: don't manage limitnofile for thanos-store

https://gerrit.wikimedia.org/r/960008

Change deployed, we'll be standing by and see if thanos still laments evaluation failures. Note that prometheus itself has experienced some, although that is completely different and tracked in T347167: Temporary prometheus alert evaluation failures on host role change

fgiunchedi claimed this task.

No reoccurence, resolving