Page MenuHomePhabricator

some Prometheis not scraping the full set of targets
Closed, ResolvedPublic

Description

I noticed this because Icinga was showing an UNKNOWN for excessive RX traffic on lvs2007, where the Prometheus query was evaluating to NaN. This was because the query happened to hit prometheus2004 several times in a row.

✔️ cdanis@icinga1001.wikimedia.org ~ 🕦☕ /usr/lib/nagios/plugins/check_prometheus_metric.py --url 'http://prometheus2003.codfw.wmnet/ops' -w '1600' -c '3200' -m 'ge' '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' --debug
DEBUG:__main__:Running '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' on 'http://prometheus2003.codfw.wmnet/ops/api/v1/query'
DEBUG:__main__:Checking vector data for [{'metric': {'instance': 'lvs2007:9100'}, 'value': [1583292619.18, '54.86336437861125']}, {'metric': {'instance': 'lvs2008:9100'}, 'value': [1583292619.18, '49.30169935226441']}, {'metric': {'instance': 'lvs2009:9100'}, 'value': [1583292619.18, '162.0370532353719']}, {'metric': {'instance': 'lvs2010:9100'}, 'value': [1583292619.18, '18.21561772028605']}]
All metrics within thresholds.

vs

❌cdanis@icinga1001.wikimedia.org ~ 🕦☕ /usr/lib/nagios/plugins/check_prometheus_metric.py --url 'http://prometheus2004.codfw.wmnet/ops' -w '1600' -c '3200' -m 'ge' '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' --debug
DEBUG:__main__:Running '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' on 'http://prometheus2004.codfw.wmnet/ops/api/v1/query'
DEBUG:__main__:Checking vector data for [{'metric': {'instance': 'lvs2009:9100'}, 'value': [1583292646.276, '162.03705183664957']}, {'metric': {'instance': 'lvs2010:9100'}, 'value': [1583292646.276, '18.30215384165446']}]
All metrics within thresholds.

I tried doing curls against lvs2007:9100 directly from both prom servers and that worked fine. Nothing obvious in the logs either.

Event Timeline

CDanis triaged this task as High priority.Mar 4 2020, 3:39 AM

Target definitions as expanded by puppet are identical on both servers too. Very weird.

1✔️ root@prometheus2004.codfw.wmnet /srv/prometheus/ops/targets 🕦⁉️ grep lvs *
2node_site_codfw.yaml: cluster: lvs
3node_site_codfw.yaml: - lvs2007:9100
4node_site_codfw.yaml: - lvs2008:9100
5node_site_codfw.yaml: - lvs2009:9100
6node_site_codfw.yaml: - lvs2010:9100
7pybal_codfw.yaml: cluster: lvs
8pybal_codfw.yaml: - lvs2007:9090
9pybal_codfw.yaml: - lvs2008:9090
10pybal_codfw.yaml: - lvs2009:9090
11pybal_codfw.yaml: - lvs2010:9090
12rsyslog_codfw.yaml: cluster: lvs
13rsyslog_codfw.yaml: - lvs2007:9105
14rsyslog_codfw.yaml: - lvs2008:9105
15rsyslog_codfw.yaml: - lvs2009:9105
16rsyslog_codfw.yaml: - lvs2010:9105

CDanis claimed this task.

netstat confirmed that prom2004 wasn't even trying

1✔️ cdanis@prometheus2003.codfw.wmnet ~ 🕝☕ sudo netstat -tp | grep lvs | sort -k5
2tcp 0 0 prometheus2003.co:56420 lvs2007.codfw.wmne:9090 ESTABLISHED 16543/prometheus
3tcp 0 0 prometheus2003.co:40660 lvs2007.codfw.wmne:9100 ESTABLISHED 16543/prometheus
4tcp 0 0 prometheus2003.co:43714 lvs2007.codfw.wmne:9105 ESTABLISHED 16543/prometheus
5tcp 0 0 prometheus2003.co:56678 lvs2008.codfw.wmne:9090 ESTABLISHED 16543/prometheus
6tcp 0 0 prometheus2003.co:40930 lvs2008.codfw.wmne:9100 ESTABLISHED 16543/prometheus
7tcp 0 0 prometheus2003.co:60888 lvs2008.codfw.wmne:9105 ESTABLISHED 16543/prometheus
8tcp 0 0 prometheus2003.co:43072 lvs2009.codfw.wmne:9090 ESTABLISHED 16543/prometheus
9tcp 0 0 prometheus2003.co:59828 lvs2009.codfw.wmne:9100 ESTABLISHED 16543/prometheus
10tcp 0 0 prometheus2003.co:37342 lvs2009.codfw.wmne:9105 ESTABLISHED 16543/prometheus
11tcp 0 0 prometheus2003.co:58838 lvs2010.codfw.wmne:9090 ESTABLISHED 16543/prometheus
12tcp 0 0 prometheus2003.co:36782 lvs2010.codfw.wmne:9100 ESTABLISHED 16543/prometheus
13tcp 0 0 prometheus2003.co:45408 lvs2010.codfw.wmne:9105 ESTABLISHED 16543/prometheus
14
15
16✔️ cdanis@prometheus2004.codfw.wmnet ~ 🕝☕ sudo netstat -tp | grep lvs | sort -k5
17tcp 0 0 prometheus2004.co:44206 lvs2007.codfw.wmne:9090 ESTABLISHED 1877/prometheus
18tcp 0 0 prometheus2004.co:53556 lvs2007.codfw.wmne:9105 ESTABLISHED 1877/prometheus
19tcp 0 0 prometheus2004.co:56142 lvs2008.codfw.wmne:9090 ESTABLISHED 1877/prometheus
20tcp 0 0 prometheus2004.co:33608 lvs2008.codfw.wmne:9105 ESTABLISHED 1877/prometheus
21tcp 0 0 prometheus2004.co:54054 lvs2009.codfw.wmne:9090 ESTABLISHED 1877/prometheus
22tcp 0 0 prometheus2004.co:54408 lvs2009.codfw.wmne:9100 ESTABLISHED 1877/prometheus
23tcp 0 0 prometheus2004.co:56596 lvs2009.codfw.wmne:9105 ESTABLISHED 1877/prometheus
24tcp 0 0 prometheus2004.co:56504 lvs2010.codfw.wmne:9090 ESTABLISHED 1877/prometheus
25tcp 0 0 prometheus2004.co:37396 lvs2010.codfw.wmne:9100 ESTABLISHED 1877/prometheus
26tcp 0 0 prometheus2004.co:55024 lvs2010.codfw.wmne:9105 ESTABLISHED 1877/prometheus
27

A restart fixed it.

I have no idea.

tcpdump & netstat confirmed that prometheus2004 wasn't even trying to connect to lvs2007|8:9100, a restart has fixed it.

Oh, also NB that a simple reload was not sufficient to fix (tried that first).

Curious indeed, thanks for investigating! I took a quick look and it looks like prometheus2004 didn't even know about lvs2007 before today at ~6.23, so I'm suspecting watching targets files failed in some way / for some reason. I can't find any smoking gun in prometheus metrics ATM we could check for alerting, although we should be updating to newer prometheus releases (e.g. 2.15)

I am investigating a similar problem for some new elasticsearch nodes (like elastic2055). If I create a ssh tunnel for prometheus2003 and 2004, I can see the metrics popping up in the web UI of 2004 but not on 2003.

CDanis removed CDanis as the assignee of this task.

I'll take a look at this tomorrow in my day, unless @fgiunchedi beats me to it.

The issue went away by itself after mentioning it to Chris and Filippo :)

I took a look at this on both prometheus200[34] for up{instance=~"elastic2055.*9108"} and the metric appears yesterday on 2004 at 9:44 and 2003 at 17:48. Whereas both target files (/srv/prometheus/ops/targets/elasticsearch_codfw.yaml) are last modified at 10:10, so definitely 2003 has been lagging behind and eventually discovered the target, for reasons yet to be determined.

There was a reload of prometheus@ops on 2003 at 17:41.

CDanis renamed this task from prometheus2004 not scraping lvs2007 & lvs2008 to some Prometheis not scraping the full set of targets.Mar 10 2020, 1:46 PM

I think the following shell one-liner could be turned into an Icinga check for this condition pretty easily:

1✔️ cdanis@bast2002.wikimedia.org ~ 🕤☕ diff -u <(curl -s 'http://prometheus2003/ops/api/v1/query?query=up' | jq -r '.data.result[].metric.instance' | sort) <(curl -s 'http://prometheus2004/ops/api/v1/query?query=up' | jq -r '.data.result[].metric.instance' | sort)
2✔️ cdanis@bast2002.wikimedia.org ~ 🕤☕ diff -u <(curl -s 'http://prometheus1003/ops/api/v1/query?query=up' | jq -r '.data.result[].metric.instance' | sort) <(curl -s 'http://prometheus1004/ops/api/v1/query?query=up' | jq -r '.data.result[].metric.instance' | sort)
3--- /proc/self/fd/11 2020-03-10 13:41:24.488662498 +0000
4+++ /proc/self/fd/12 2020-03-10 13:41:24.488662498 +0000
5@@ -1464,9 +1464,13 @@
6 logstash1025:9198
7 logstash1025:9631
8 logstash1026:9100
9+logstash1026:9105
10 logstash1027:9100
11+logstash1027:9105
12 logstash1028:9100
13+logstash1028:9105
14 logstash1029:9100
15+logstash1029:9105
16 lvs1013:9090
17 lvs1013:9100
18 lvs1013:9105
19@@ -3825,8 +3829,11 @@
20 wdqs1010:9193
21 wdqs1010:9194
22 wdqs1011:9100
23+wdqs1011:9105
24 wdqs1012:9100
25+wdqs1012:9105
26 wdqs1013:9100
27+wdqs1013:9105
28 weblog1001:9100
29 weblog1001:9105
30 webperf1001:9100

25h later, there's additional skew between 1003 and 1004:

1--- /proc/self/fd/13 2020-03-11 14:36:54.988930894 +0000
2+++ /proc/self/fd/14 2020-03-11 14:36:54.988930894 +0000
3@@ -1210,7 +1210,6 @@
4 ganeti1007:9105
5 ganeti1008:9100
6 ganeti1008:9105
7-ganeti1009:9105
8 ganeti1010:9100
9 ganeti1010:9105
10 ganeti1011:9100
11@@ -1463,9 +1462,13 @@
12 logstash1025:9198
13 logstash1025:9631
14 logstash1026:9100
15+logstash1026:9105
16 logstash1027:9100
17+logstash1027:9105
18 logstash1028:9100
19+logstash1028:9105
20 logstash1029:9100
21+logstash1029:9105
22 lvs1013:9090
23 lvs1013:9100
24 lvs1013:9105
25@@ -3735,7 +3738,6 @@
26 stat1006:9105
27 stat1007:9100
28 stat1007:9105
29-stat1008:9100
30 thorium:9100
31 thorium:9105
32 thumbor1001:3903
33@@ -3825,8 +3827,11 @@
34 wdqs1010:9193
35 wdqs1010:9194
36 wdqs1011:9100
37+wdqs1011:9105
38 wdqs1012:9100
39+wdqs1012:9105
40 wdqs1013:9100
41+wdqs1013:9105
42 weblog1001:9100
43 weblog1001:9105
44 webperf1001:9100

2003/2004 are both presently identical.

As of this morning: 2003/2004 presently identical.

Skew still increased on 1003/1004:

1--- /proc/self/fd/13 2020-03-12 13:24:49.899324906 +0000
2+++ /proc/self/fd/14 2020-03-12 13:24:49.899324906 +0000
3@@ -1210,7 +1210,6 @@
4 ganeti1007:9105
5 ganeti1008:9100
6 ganeti1008:9105
7-ganeti1009:9105
8 ganeti1010:9100
9 ganeti1010:9105
10 ganeti1011:9100
11@@ -1242,7 +1241,6 @@
12 gerrit1002:9100
13 gerrit1002:9105
14 gerrit.wikimedia.org:443
15-grafana1001:9105
16 grafana1002:9100
17 grafana1002:9105
18 grafana1002:9631
19@@ -1462,9 +1460,13 @@
20 logstash1025:9198
21 logstash1025:9631
22 logstash1026:9100
23+logstash1026:9105
24 logstash1027:9100
25+logstash1027:9105
26 logstash1028:9100
27+logstash1028:9105
28 logstash1029:9100
29+logstash1029:9105
30 lvs1013:9090
31 lvs1013:9100
32 lvs1013:9105
33@@ -3727,7 +3729,9 @@
34 sodium:9100
35 sodium:9105
36 sretest1001:9100
37+sretest1001:9105
38 sretest1002:9100
39+sretest1002:9105
40 stat1004:9100
41 stat1004:9105
42 stat1005:9100
43@@ -3737,6 +3741,7 @@
44 stat1007:9100
45 stat1007:9105
46 stat1008:9100
47+stat1008:9105
48 thorium:9100
49 thorium:9105
50 thumbor1001:3903
51@@ -3826,8 +3831,11 @@
52 wdqs1010:9193
53 wdqs1010:9194
54 wdqs1011:9100
55+wdqs1011:9105
56 wdqs1012:9100
57+wdqs1012:9105
58 wdqs1013:9100
59+wdqs1013:9105
60 weblog1001:9100
61 weblog1001:9105
62 webperf1001:9100

Aklapper renamed this task from some Prometheis not scraping the full set of targets to some Prometheus not scraping the full set of targets.Mar 12 2020, 2:46 PM
CDanis renamed this task from some Prometheus not scraping the full set of targets to some Prometheis not scraping the full set of targets.Mar 12 2020, 2:47 PM

@Aklapper: "After extensive research, it has been determined that the correct plural of 'Prometheus' is 'Prometheis'."
https://prometheus.io/docs/introduction/faq/#what-is-the-plural-of-prometheus

It appears a reload does resolve the issue, but it takes some time for Prometheus to fetch and store an update. I used kill -HUP <PID> to reload.

The logs immediately following a reload seem to indicate the process cannot watch the config files for changes. Puppet does not issue a reload when target files are modified in any way.

# Issued HUP here
Mar 13 00:20:34 prometheus1003 prometheus@ops[20969]: level=info ts=2020-03-13T00:20:34.167231405Z caller=main.go:698 msg="Loading configuration file" filename=/srv/prometheus/ops/prometheus.yml
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=info ts=2020-03-13T00:20:38.818681106Z caller=main.go:725 msg="Completed loading of configuration file" filename=/srv/prometheus/ops/prometheus.yml
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.852335966Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860305321Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860310584Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860314889Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860316497Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860315009Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860302454Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860310984Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860309546Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860309539Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"

Nice catch in the logs!

I'm guessing we need to increase some of the inotify tunables. Probably this one:

✔️ cdanis@prometheus1003.eqiad.wmnet ~ 🕣🍺 cat /proc/sys/fs/inotify/max_user_watches
8192

There's only so many files to be watched, but judging from lsof output, each prometheus thread holds an inotify handle on all of them, so it adds up quickly.

Less likely (but possible) is needing to also increase fs.inotify.max_user_instances = 128

Change 579453 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prom@ops: increase inotify watches in core DCs

https://gerrit.wikimedia.org/r/579453

Change 579453 merged by Cwhite:
[operations/puppet@production] prom@ops: increase inotify watches in core DCs

https://gerrit.wikimedia.org/r/579453

Change 579613 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: increase inotify instances in core DCs

https://gerrit.wikimedia.org/r/579613

Change 579613 merged by Cwhite:
[operations/puppet@production] profile: increase inotify instances in core DCs

https://gerrit.wikimedia.org/r/579613

Found a related issue.

We've upped max_user_instances and max_user_watches and the error logs haven't come back.

Let's check back up on this in a week or so.

Krinkle renamed this task from some Prometheis not scraping the full set of targets to some Prometheus not scraping the full set of targets.Mar 13 2020, 7:55 PM
CDanis renamed this task from some Prometheus not scraping the full set of targets to some Prometheis not scraping the full set of targets.Mar 13 2020, 7:57 PM
CDanis added a subscriber: Krinkle.

Thanks... I guess that's bound to cause confusion, but I will try to be one of the people that does know and remembers.

No current differences in targets scraped between the proms in each cluster, and no hits for too many open files in their recent logs either.