some Prometheis not scraping the full set of targets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Mar 4 2020, 3:39 AM

Description

I noticed this because Icinga was showing an UNKNOWN for excessive RX traffic on lvs2007, where the Prometheus query was evaluating to NaN. This was because the query happened to hit prometheus2004 several times in a row.

✔️ cdanis@icinga1001.wikimedia.org ~ 🕦☕ /usr/lib/nagios/plugins/check_prometheus_metric.py --url 'http://prometheus2003.codfw.wmnet/ops' -w '1600' -c '3200' -m 'ge' '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' --debug
DEBUG:__main__:Running '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' on 'http://prometheus2003.codfw.wmnet/ops/api/v1/query'
DEBUG:__main__:Checking vector data for [{'metric': {'instance': 'lvs2007:9100'}, 'value': [1583292619.18, '54.86336437861125']}, {'metric': {'instance': 'lvs2008:9100'}, 'value': [1583292619.18, '49.30169935226441']}, {'metric': {'instance': 'lvs2009:9100'}, 'value': [1583292619.18, '162.0370532353719']}, {'metric': {'instance': 'lvs2010:9100'}, 'value': [1583292619.18, '18.21561772028605']}]
All metrics within thresholds.

❌cdanis@icinga1001.wikimedia.org ~ 🕦☕ /usr/lib/nagios/plugins/check_prometheus_metric.py --url 'http://prometheus2004.codfw.wmnet/ops' -w '1600' -c '3200' -m 'ge' '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' --debug
DEBUG:__main__:Running '(sum by (instance) (rate(node_network_receive_bytes_total{instance=~"lvs.*",device!~"lo"}[5m]))) * 8 / 1024 / 1024' on 'http://prometheus2004.codfw.wmnet/ops/api/v1/query'
DEBUG:__main__:Checking vector data for [{'metric': {'instance': 'lvs2009:9100'}, 'value': [1583292646.276, '162.03705183664957']}, {'metric': {'instance': 'lvs2010:9100'}, 'value': [1583292646.276, '18.30215384165446']}]
All metrics within thresholds.

I tried doing curls against lvs2007:9100 directly from both prom servers and that worked fine. Nothing obvious in the logs either.

Details

	Subject	Repo	Branch	Lines +/-
	profile: increase inotify instances in core DCs	operations/puppet	production	+3 -2
	prom@ops: increase inotify watches in core DCs	operations/puppet	production	+7 -0

Customize query in gerrit

Related Objects

Mentioned Here: P10691 Masterwork From Distant Lands
P10686 Masterwork From Distant Lands
P10675 (An Untitled Masterwork)
P10604 (An Untitled Masterwork)
P10603 (An Untitled Masterwork)

Event Timeline

CDanis created this task.Mar 4 2020, 3:39 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2020, 3:39 AM

CDanis triaged this task as High priority.Mar 4 2020, 3:39 AM

Target definitions as expanded by puppet are identical on both servers too. Very weird.

P10603 (An Untitled Masterwork)

1	✔️ root@prometheus2004.codfw.wmnet /srv/prometheus/ops/targets 🕦⁉️ grep lvs *
2	node_site_codfw.yaml: cluster: lvs
3	node_site_codfw.yaml: - lvs2007:9100
4	node_site_codfw.yaml: - lvs2008:9100
5	node_site_codfw.yaml: - lvs2009:9100
6	node_site_codfw.yaml: - lvs2010:9100
7	pybal_codfw.yaml: cluster: lvs
8	pybal_codfw.yaml: - lvs2007:9090
9	pybal_codfw.yaml: - lvs2008:9090
10	pybal_codfw.yaml: - lvs2009:9090
11	pybal_codfw.yaml: - lvs2010:9090
12	rsyslog_codfw.yaml: cluster: lvs
13	rsyslog_codfw.yaml: - lvs2007:9105
14	rsyslog_codfw.yaml: - lvs2008:9105
15	rsyslog_codfw.yaml: - lvs2009:9105
16	rsyslog_codfw.yaml: - lvs2010:9105

netstat confirmed that prom2004 wasn't even trying

2tcp 3tcp 4tcp 5tcp 6tcp 7tcp 8tcp 9tcp 10tcp 11tcp 12tcp 13tcp 14 15 16✔️ 17tcp 18tcp 19tcp 20tcp 21tcp 22tcp 23tcp 24tcp 25tcp 26tcp 27

href="/P10604">P10604 (An Untitled Masterwork)

✔️ cdanis@prometheus2003.codfw.wmnet ~ 🕝☕ sudo netstat -tp | grep lvs | sort -k5 0 0 prometheus2003.co:56420 lvs2007.codfw.wmne:9090 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:40660 lvs2007.codfw.wmne:9100 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:43714 lvs2007.codfw.wmne:9105 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:56678 lvs2008.codfw.wmne:9090 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:40930 lvs2008.codfw.wmne:9100 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:60888 lvs2008.codfw.wmne:9105 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:43072 lvs2009.codfw.wmne:9090 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:59828 lvs2009.codfw.wmne:9100 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:37342 lvs2009.codfw.wmne:9105 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:58838 lvs2010.codfw.wmne:9090 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:36782 lvs2010.codfw.wmne:9100 ESTABLISHED 16543/prometheus 0 0 prometheus2003.co:45408 lvs2010.codfw.wmne:9105 ESTABLISHED 16543/prometheus cdanis@prometheus2004.codfw.wmnet ~ 🕝☕ sudo netstat -tp | grep lvs | sort -k5 0 0 prometheus2004.co:44206 lvs2007.codfw.wmne:9090 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:53556 lvs2007.codfw.wmne:9105 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:56142 lvs2008.codfw.wmne:9090 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:33608 lvs2008.codfw.wmne:9105 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:54054 lvs2009.codfw.wmne:9090 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:54408 lvs2009.codfw.wmne:9100 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:56596 lvs2009.codfw.wmne:9105 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:56504 lvs2010.codfw.wmne:9090 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:37396 lvs2010.codfw.wmne:9100 ESTABLISHED 1877/prometheus 0 0 prometheus2004.co:55024 lvs2010.codfw.wmne:9105 ESTABLISHED 1877/prometheus

A restart fixed it.

I have no idea.

tcpdump & netstat confirmed that prometheus2004 wasn't even trying to connect to lvs2007|8:9100, a restart has fixed it.

Oh, also NB that a simple reload was not sufficient to fix (tried that first).

Curious indeed, thanks for investigating! I took a quick look and it looks like prometheus2004 didn't even know about lvs2007 before today at ~6.23, so I'm suspecting watching targets files failed in some way / for some reason. I can't find any smoking gun in prometheus metrics ATM we could check for alerting, although we should be updating to newer prometheus releases (e.g. 2.15)

I am investigating a similar problem for some new elasticsearch nodes (like elastic2055). If I create a ssh tunnel for prometheus2003 and 2004, I can see the metrics popping up in the web UI of 2004 but not on 2003.

I'll take a look at this tomorrow in my day, unless @fgiunchedi beats me to it.

The issue went away by itself after mentioning it to Chris and Filippo :)

I took a look at this on both prometheus200[34] for up{instance=~"elastic2055.*9108"} and the metric appears yesterday on 2004 at 9:44 and 2003 at 17:48. Whereas both target files (/srv/prometheus/ops/targets/elasticsearch_codfw.yaml) are last modified at 10:10, so definitely 2003 has been lagging behind and eventually discovered the target, for reasons yet to be determined.

There was a reload of prometheus@ops on 2003 at 17:41.

I think the following shell one-liner could be turned into an Icinga check for this condition pretty easily:

P10675 (An Untitled Masterwork)

1	✔️ cdanis@bast2002.wikimedia.org ~ 🕤☕ diff -u <(curl -s 'http://prometheus2003/ops/api/v1/query?query=up' \| jq -r '.data.result[].metric.instance' \| sort) <(curl -s 'http://prometheus2004/ops/api/v1/query?query=up' \| jq -r '.data.result[].metric.instance' \| sort)
2	✔️ cdanis@bast2002.wikimedia.org ~ 🕤☕ diff -u <(curl -s 'http://prometheus1003/ops/api/v1/query?query=up' \| jq -r '.data.result[].metric.instance' \| sort) <(curl -s 'http://prometheus1004/ops/api/v1/query?query=up' \| jq -r '.data.result[].metric.instance' \| sort)
3	--- /proc/self/fd/11 2020-03-10 13:41:24.488662498 +0000
4	+++ /proc/self/fd/12 2020-03-10 13:41:24.488662498 +0000
5	@@ -1464,9 +1464,13 @@
6	logstash1025:9198
7	logstash1025:9631
8	logstash1026:9100
9	+logstash1026:9105
10	logstash1027:9100
11	+logstash1027:9105
12	logstash1028:9100
13	+logstash1028:9105
14	logstash1029:9100
15	+logstash1029:9105
16	lvs1013:9090
17	lvs1013:9100
18	lvs1013:9105
19	@@ -3825,8 +3829,11 @@
20	wdqs1010:9193
21	wdqs1010:9194
22	wdqs1011:9100
23	+wdqs1011:9105
24	wdqs1012:9100
25	+wdqs1012:9105
26	wdqs1013:9100
27	+wdqs1013:9105
28	weblog1001:9100
29	weblog1001:9105
30	webperf1001:9100

• ema moved this task from Backlog to General on the Traffic board.Mar 11 2020, 1:53 PM

25h later, there's additional skew between 1003 and 1004:

P10686 Masterwork From Distant Lands

1	--- /proc/self/fd/13 2020-03-11 14:36:54.988930894 +0000
2	+++ /proc/self/fd/14 2020-03-11 14:36:54.988930894 +0000
3	@@ -1210,7 +1210,6 @@
4	ganeti1007:9105
5	ganeti1008:9100
6	ganeti1008:9105
7	-ganeti1009:9105
8	ganeti1010:9100
9	ganeti1010:9105
10	ganeti1011:9100
11	@@ -1463,9 +1462,13 @@
12	logstash1025:9198
13	logstash1025:9631
14	logstash1026:9100
15	+logstash1026:9105
16	logstash1027:9100
17	+logstash1027:9105
18	logstash1028:9100
19	+logstash1028:9105
20	logstash1029:9100
21	+logstash1029:9105
22	lvs1013:9090
23	lvs1013:9100
24	lvs1013:9105
25	@@ -3735,7 +3738,6 @@
26	stat1006:9105
27	stat1007:9100
28	stat1007:9105
29	-stat1008:9100
30	thorium:9100
31	thorium:9105
32	thumbor1001:3903
33	@@ -3825,8 +3827,11 @@
34	wdqs1010:9193
35	wdqs1010:9194
36	wdqs1011:9100
37	+wdqs1011:9105
38	wdqs1012:9100
39	+wdqs1012:9105
40	wdqs1013:9100
41	+wdqs1013:9105
42	weblog1001:9100
43	weblog1001:9105
44	webperf1001:9100

2003/2004 are both presently identical.

As of this morning: 2003/2004 presently identical.

Skew still increased on 1003/1004:

P10691 Masterwork From Distant Lands

1	--- /proc/self/fd/13 2020-03-12 13:24:49.899324906 +0000
2	+++ /proc/self/fd/14 2020-03-12 13:24:49.899324906 +0000
3	@@ -1210,7 +1210,6 @@
4	ganeti1007:9105
5	ganeti1008:9100
6	ganeti1008:9105
7	-ganeti1009:9105
8	ganeti1010:9100
9	ganeti1010:9105
10	ganeti1011:9100
11	@@ -1242,7 +1241,6 @@
12	gerrit1002:9100
13	gerrit1002:9105
14	gerrit.wikimedia.org:443
15	-grafana1001:9105
16	grafana1002:9100
17	grafana1002:9105
18	grafana1002:9631
19	@@ -1462,9 +1460,13 @@
20	logstash1025:9198
21	logstash1025:9631
22	logstash1026:9100
23	+logstash1026:9105
24	logstash1027:9100
25	+logstash1027:9105
26	logstash1028:9100
27	+logstash1028:9105
28	logstash1029:9100
29	+logstash1029:9105
30	lvs1013:9090
31	lvs1013:9100
32	lvs1013:9105
33	@@ -3727,7 +3729,9 @@
34	sodium:9100
35	sodium:9105
36	sretest1001:9100
37	+sretest1001:9105
38	sretest1002:9100
39	+sretest1002:9105
40	stat1004:9100
41	stat1004:9105
42	stat1005:9100
43	@@ -3737,6 +3741,7 @@
44	stat1007:9100
45	stat1007:9105
46	stat1008:9100
47	+stat1008:9105
48	thorium:9100
49	thorium:9105
50	thumbor1001:3903
51	@@ -3826,8 +3831,11 @@
52	wdqs1010:9193
53	wdqs1010:9194
54	wdqs1011:9100
55	+wdqs1011:9105
56	wdqs1012:9100
57	+wdqs1012:9105
58	wdqs1013:9100
59	+wdqs1013:9105
60	weblog1001:9100
61	weblog1001:9105
62	webperf1001:9100

Aklapper renamed this task from some Prometheis not scraping the full set of targets to some Prometheus not scraping the full set of targets.Mar 12 2020, 2:46 PM

@Aklapper: "After extensive research, it has been determined that the correct plural of 'Prometheus' is 'Prometheis'."
— https://prometheus.io/docs/introduction/faq/#what-is-the-plural-of-prometheus

It appears a reload does resolve the issue, but it takes some time for Prometheus to fetch and store an update. I used kill -HUP <PID> to reload.

The logs immediately following a reload seem to indicate the process cannot watch the config files for changes. Puppet does not issue a reload when target files are modified in any way.

# Issued HUP here
Mar 13 00:20:34 prometheus1003 prometheus@ops[20969]: level=info ts=2020-03-13T00:20:34.167231405Z caller=main.go:698 msg="Loading configuration file" filename=/srv/prometheus/ops/prometheus.yml
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=info ts=2020-03-13T00:20:38.818681106Z caller=main.go:725 msg="Completed loading of configuration file" filename=/srv/prometheus/ops/prometheus.yml
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.852335966Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860305321Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860310584Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860314889Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860316497Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860315009Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860302454Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860310984Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860309546Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Mar 13 00:20:38 prometheus1003 prometheus@ops[20969]: level=error ts=2020-03-13T00:20:38.860309539Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"

Nice catch in the logs!

I'm guessing we need to increase some of the inotify tunables. Probably this one:

✔️ cdanis@prometheus1003.eqiad.wmnet ~ 🕣🍺 cat /proc/sys/fs/inotify/max_user_watches
8192

There's only so many files to be watched, but judging from lsof output, each prometheus thread holds an inotify handle on all of them, so it adds up quickly.

Less likely (but possible) is needing to also increase fs.inotify.max_user_instances = 128

Change 579453 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prom@ops: increase inotify watches in core DCs

https://gerrit.wikimedia.org/r/579453

gerritbot added a project: Patch-For-Review.Mar 13 2020, 2:38 AM

Change 579453 merged by Cwhite:
[operations/puppet@production] prom@ops: increase inotify watches in core DCs

https://gerrit.wikimedia.org/r/579453

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2020, 7:10 PM

colewhite claimed this task.Mar 13 2020, 7:22 PM

Change 579613 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: increase inotify instances in core DCs

https://gerrit.wikimedia.org/r/579613

gerritbot added a project: Patch-For-Review.Mar 13 2020, 7:26 PM

Change 579613 merged by Cwhite:
[operations/puppet@production] profile: increase inotify instances in core DCs

https://gerrit.wikimedia.org/r/579613

Found a related issue.

We've upped max_user_instances and max_user_watches and the error logs haven't come back.

Let's check back up on this in a week or so.

Krinkle renamed this task from some Prometheis not scraping the full set of targets to some Prometheus not scraping the full set of targets.Mar 13 2020, 7:55 PM

@Krinkle please see https://phabricator.wikimedia.org/T246860#5964089 :)

Thanks... I guess that's bound to cause confusion, but I will try to be one of the people that does know and remembers.

Krinkle unsubscribed.Mar 13 2020, 9:03 PM

fgiunchedi moved this task from Inbox to In progress on the observability board.Mar 16 2020, 2:18 PM

No current differences in targets scraped between the proms in each cluster, and no hits for too many open files in their recent logs either.

some Prometheis not scraping the full set of targetsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

some Prometheis not scraping the full set of targets
Closed, ResolvedPublic
Actions