Today titan2001 alerted briefly (it cleared by itself after a few minutes) for rule evaluation failures
{cluster="titan", instance="titan2001:17902", job="thanos-rule", prometheus="ops", rule_group="/srv/thanos-rule/.tmp-rules/ABORT/etc/thanos-rule/rules/recording_rules.yaml;service_slis", site="codfw", strategy="abort"}
Looking into logs, thanos-rule on titan2001 was throwing too many open files:
dial tcp: lookup thanos-swift.discovery.wmnet on 10.3.0.1:53: dial udp 10.3.0.1:53: socket: too many open files
Full log messages:
Sep 20 17:02:43 titan2001 thanos-rule[210150]: level=error ts=2023-09-20T17:02:43.840482127Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: localhost:11901 LabelSets: {prometheus=\\\"analytics\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"analytics\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"analytics\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"analytics\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"aux-k8s\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"aux-k8s\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ext\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"ext\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ext\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"ext\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-aux\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-aux\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-dse\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-dse\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-mlserve\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-mlstaging\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-mlstaging\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"k8s-staging\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"drmrs\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"eqsin\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"esams\\\"},{prometheus=\\\"ops\\\", replica=\\\"a\\\", site=\\\"ulsfo\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"eqsin\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"esams\\\"},{prometheus=\\\"ops\\\", replica=\\\"b\\\", site=\\\"ulsfo\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"drmrs\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"eqsin\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"esams\\\"},{prometheus=\\\"ops\\\", replica=\\\"unset\\\", site=\\\"ulsfo\\\"},{prometheus=\\\"services\\\", replica=\\\"a\\\", site=\\\"codfw\\\"},{prometheus=\\\"services\\\", replica=\\\"a\\\", site=\\\"eqiad\\\"},{prometheus=\\\"services\\\", replica=\\\"b\\\", site=\\\"codfw\\\"},{prometheus=\\\"services\\\", replica=\\\"b\\\", site=\\\"eqiad\\\"},{replica=\\\"a\\\"},{replica=\\\"b\\\"} Mint: 1591833600000 Maxt: 1693933343962: rpc error: code = Unknown desc = receive series from 01H3HCD9JHDMW8JNZ4DQVV0KQT: load chunks: get range reader: Get \\\"https://thanos-swift.discovery.wmnet/thanos/01H3HCD9JHDMW8JNZ4DQVV0KQT/chunks/000106\\\": dial tcp: lookup thanos-swift.discovery.wmnet on 10.3.0.1:53: dial udp 10.3.0.1:53: socket: too many open files\"}\n" query="100 * sum by (cluster, site) (increase(etcd_http_failed_total{code=~\"5..\"}[92d])) / sum by (cluster, site) (increase(etcd_http_received_total[92d]))" Sep 20 17:02:43 titan2001 thanos-rule[210150]: level=warn ts=2023-09-20T17:02:43.840578249Z caller=manager.go:639 component=rules file=/srv/thanos-rule/.tmp-rules/ABORT/etc/thanos-rule/rules/recording_rules.yaml group=service_slis name=cluster_site:sli_etcd_http_error_ratio:increase92d index=2 msg="Evaluating rule failed" rule="record: cluster_site:sli_etcd_http_error_ratio:increase92d\nexpr: 100 * sum by (cluster, site) (increase(etcd_http_failed_total{code=~\"5..\"}[92d]))\n / sum by (cluster, site) (increase(etcd_http_received_total[92d]))\n" err="no query API server reachable"