Page MenuHomePhabricator

Frequent puppet failures
Open, MediumPublic

Description

The last few days we get a lot of icinga alerts related to catalogue fetching:

2019-04-21 00:43:52	<+icinga-wm>	PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 01:28:29	<+icinga-wm>	PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 03:35:21	<+icinga-wm>	PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 03:52:01	<+icinga-wm>	PROBLEM - puppet last run on cloudelastic1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 04:50:05	<+icinga-wm>	PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 06:01:33	<+icinga-wm>	PROBLEM - puppet last run on db1118 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 07:17:49	<+icinga-wm>	PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 08:33:27	<+icinga-wm>	PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 09:32:09	<+icinga-wm>	PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 16:59:07	<+icinga-wm>	PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 17:18:55	<+icinga-wm>	PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 17:50:01	<+icinga-wm>	PROBLEM - puppet last run on logstash1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 21:04:15     <+icinga-wm>    PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues

I am very sorry, times are in GMT+3.

We had a total of 82 puppet failures where about of 50 are within 3 days. I just run grep on my own logs so those numbers may not exactly accurate.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
OpenNone

Event Timeline

A bit out of date, but

1~ % cat .weechat/logs/irc.znc.\#wikimedia-operations.weechatlog| grep 'Catalog fetch fail' | cut -f1 -d' ' | uniq -c
2 1 2018-11-08
3 2 2018-11-09
4 2 2018-11-10
5 2 2018-11-11
6 2 2018-11-13
7 8 2018-11-14
8 5 2018-11-15
9 1 2018-11-16
10 3 2018-11-17
11 1 2018-11-18
12 5 2018-11-19
13 1 2018-11-20
14 2 2018-11-21
15 5 2018-11-22
16 3 2018-11-23
17 1 2018-11-24
18 2 2018-11-25
19 3 2018-11-26
20 5 2018-11-27
21 9 2018-11-28
22 8 2018-11-29
23 4 2018-11-30
24 1 2018-12-01
25 4 2018-12-02
26 42 2018-12-03
27 5 2018-12-04
28 6 2018-12-05
29 10 2018-12-06
30 2 2018-12-07
31 1 2018-12-08
32 2 2018-12-09
33 2 2018-12-11
34 1 2018-12-12
35 2 2018-12-13
36 2 2018-12-14
37 1 2018-12-15
38 2 2018-12-16
39 1 2018-12-17
40 1 2018-12-18
41 4 2018-12-19
42 1 2018-12-23
43 1 2018-12-27
44 1 2018-12-30
45 4 2019-01-02
46 10 2019-01-03
47 3 2019-01-04
48 2 2019-01-06
49 1 2019-01-07
50 1 2019-01-08
51 1 2019-01-09
52 3 2019-01-10
53 2 2019-01-13
54 2 2019-01-14
55 1 2019-01-15
56 1 2019-01-16
57 1 2019-01-18
58 2 2019-01-19
59 3 2019-01-22
60 19 2019-01-23
61 4 2019-01-24
62 115 2019-01-25
63 1 2019-01-26
64 2 2019-01-27
65 1 2019-01-28
66 1 2019-01-29
67 1 2019-01-30
68 1 2019-01-31
69 1 2019-02-01
70 2 2019-02-02
71 43 2019-02-04
72 8 2019-02-05
73 2 2019-02-06
74 2 2019-02-07
75 1 2019-02-08
76 1 2019-02-10
77 11 2019-02-11
78 2 2019-02-12
79 16 2019-02-13
80 1 2019-02-14
81 4 2019-02-16
82 1 2019-02-19
83 1 2019-02-21
84 1 2019-02-23
85 2 2019-02-24
86 1 2019-02-26
87 2 2019-02-27
88 4 2019-02-28
89 188 2019-03-01
90 2 2019-03-02
91 2 2019-03-05
92 4 2019-03-07
93 2 2019-03-08
94 1 2019-03-09
95 1 2019-03-10
96 2 2019-03-12
97 3 2019-03-14
98 48 2019-03-15
99 1 2019-03-16
100 2 2019-03-17
101 1 2019-03-19
102 4 2019-03-20
103 14 2019-03-21
104 17 2019-03-22
105 13 2019-03-23
106 11 2019-03-24
107 6 2019-03-25
108 15 2019-03-26
109 6 2019-03-27
110 12 2019-03-28
111 19 2019-03-29
112 16 2019-03-30
113 13 2019-03-31
114 12 2019-04-01
115 17 2019-04-02
116 9 2019-04-03

Here are related puppet agent and puppetmaster1001 apache logs from a sampling of hosts

Apr 20 21:38:22 planet1001 puppet-agent[7083]: Using configured environment 'production'
Apr 20 21:38:22 planet1001 puppet-agent[7083]: Retrieving pluginfacts
Apr 20 21:38:22 planet1001 puppet-agent[7083]: Retrieving plugin
Apr 20 21:38:22 planet1001 puppet-agent[7083]: Loading facts
Apr 20 21:38:29 planet1001 puppet-agent[7083]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <html><head>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <title>503 Proxy Error</title>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: </head><body>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <h1>Proxy Error</h1>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <p>The server is temporarily unable to service your
Apr 20 21:38:29 planet1001 puppet-agent[7083]: request due to maintenance downtime or capacity
Apr 20 21:38:29 planet1001 puppet-agent[7083]: problems. Please try again later.</p>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: </body></html>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: Not using cache on failed catalog
Apr 20 21:38:29 planet1001 puppet-agent[7083]: Could not retrieve catalog; skipping run
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:27.263867 2019] [proxy_http:error] [pid 19553] (70007)The timeout specified has expired: [client 2620:0:861:101:10:64:0:50:41980] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:27.263928 2019] [proxy:error] [pid 19553] [client 2620:0:861:101:10:64:0:50:41980] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/planet1001.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:28.302818 2019] [proxy_http:error] [pid 19553] (70007)The timeout specified has expired: [client 2620:0:861:101:10:64:0:50:41980] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:28.302864 2019] [proxy:error] [pid 19553] [client 2620:0:861:101:10:64:0:50:41980] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/planet1001.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:29.319062 2019] [proxy_http:error] [pid 19553] (70007)The timeout specified has expired: [client 2620:0:861:101:10:64:0:50:41980] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:29.319096 2019] [proxy:error] [pid 19553] [client 2620:0:861:101:10:64:0:50:41980] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/planet1001.eqiad.wmnet
Apr 21 06:24:48 mw1307 puppet-agent[40270]: Using configured environment 'production'
Apr 21 06:24:48 mw1307 puppet-agent[40270]: Retrieving pluginfacts
Apr 21 06:24:48 mw1307 puppet-agent[40270]: Retrieving plugin
Apr 21 06:24:49 mw1307 puppet-agent[40270]: Loading facts
Apr 21 06:24:56 mw1307 puppet-agent[40270]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <html><head>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <title>503 Proxy Error</title>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: </head><body>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <h1>Proxy Error</h1>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <p>The server is temporarily unable to service your
Apr 21 06:24:56 mw1307 puppet-agent[40270]: request due to maintenance downtime or capacity
Apr 21 06:24:56 mw1307 puppet-agent[40270]: problems. Please try again later.</p>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: </body></html>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: Not using cache on failed catalog
Apr 21 06:24:56 mw1307 puppet-agent[40270]: Could not retrieve catalog; skipping run
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:54.633279 2019] [proxy_http:error] [pid 22523] (70007)The timeout specified has expired: [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:54.633337 2019] [proxy:error] [pid 22523] [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/mw1307.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:55.649631 2019] [proxy_http:error] [pid 22523] (70007)The timeout specified has expired: [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:55.649662 2019] [proxy:error] [pid 22523] [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/mw1307.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:56.663715 2019] [proxy_http:error] [pid 22523] (70007)The timeout specified has expired: [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:56.663791 2019] [proxy:error] [pid 22523] [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/mw1307.eqiad.wmnet
Apr 21 18:00:23 gerrit2001 puppet-agent[9368]: Using configured environment 'production'
Apr 21 18:00:23 gerrit2001 puppet-agent[9368]: Retrieving pluginfacts
Apr 21 18:00:24 gerrit2001 puppet-agent[9368]: Retrieving plugin
Apr 21 18:00:24 gerrit2001 puppet-agent[9368]: Loading facts
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <html><head>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <title>503 Proxy Error</title>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: </head><body>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <h1>Proxy Error</h1>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <p>The server is temporarily unable to service your
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: request due to maintenance downtime or capacity
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: problems. Please try again later.</p>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: </body></html>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: Not using cache on failed catalog
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: Could not retrieve catalog; skipping run
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:28.256441 2019] [proxy_http:error] [pid 20452] (70007)The timeout specified has expired: [client 2620:0:860:4:208:80:153:106:46686] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:28.256516 2019] [proxy:error] [pid 20452] [client 2620:0:860:4:208:80:153:106:46686] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/gerrit2001.wikimedia.org
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:29.271845 2019] [proxy_http:error] [pid 20452] (70007)The timeout specified has expired: [client 2620:0:860:4:208:80:153:106:46686] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:29.271888 2019] [proxy:error] [pid 20452] [client 2620:0:860:4:208:80:153:106:46686] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/gerrit2001.wikimedia.org
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:30.286368 2019] [proxy_http:error] [pid 20452] (70007)The timeout specified has expired: [client 2620:0:860:4:208:80:153:106:46686] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:30.286403 2019] [proxy:error] [pid 20452] [client 2620:0:860:4:208:80:153:106:46686] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/gerrit2001.wikimedia.org

I forget where but in digging about this it seems that Puppet will return 503 if it is too busy, there are numerous reports of this (to be clear I don't know if it's puppet itself or an intermediary that returns 503, but the result from the client's perspective is this).

herron triaged this task as Medium priority.Apr 26 2019, 10:44 PM

Have not had time to look at this in depth yet however i did just notice an issue while applying a refactor change[1]

while applying the change set i got the following eror on a few hosts (mw1286 and cp2006) as examples.

Apr 29 15:10:31 mw1286 puppet-agent[17486]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Standard] is already declared; cannot re declare at /etc/puppet/modules/profile/manifests/standard.pp:5 at /etc/puppet/modules/profile/manifests/standard.pp:5:5 on node mw1286.eqiad.wmnet

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

[1]https://gerrit.wikimedia.org/r/c/operations/puppet/+/506990

Have not had time to look at this in depth yet however i did just notice an issue while applying a refactor change[1]

while applying the change set i got the following eror on a few hosts (mw1286 and cp2006) as examples.

Apr 29 15:10:31 mw1286 puppet-agent[17486]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Standard] is already declared; cannot re declare at /etc/puppet/modules/profile/manifests/standard.pp:5 at /etc/puppet/modules/profile/manifests/standard.pp:5:5 on node mw1286.eqiad.wmnet

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

[1]https://gerrit.wikimedia.org/r/c/operations/puppet/+/506990

Has anyone checked if the 5xx errors happen to coincide with puppet-merge happening?

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

The update is definitely not atomic. It's a effectively a loop over all puppetmasters for a git pull. The puppet protocol is essentially first an HTTP POST which returns the compiled catalog as a response, then a number of GETs to fetch the various file resources that have a source parameter (note that resources with a content parameter are already in the compiled catalog). This definitely allows for the following scenario:

  1. The agent POSTs the facts to puppet and gets loadbalanced to puppetmaster X. It awaits the response for the compiled catalog.
  2. puppetmaster X does the compilation and sends a compiled catalog back with a number of file resources with source parameters
  3. A puppet-merge starts updating puppetmasters
  4. The agent reaches out via GETs to obtain the various file resources with source parameters. These are numerous (at some point years ago we counted ~1k IIRC for some agents) and all get load balanced.
  5. Some GETs reach puppetmasters where puppet-merge has happened, some do not.
  6. If the change merged removed/added some files, fetching those will fail as the compiled catalog served will refer to non yet/no more existing files.
  7. icinga will scream.

This is to illustrate that the issue exists, but does not explain the error posted above, which references the exact same file.

! In T221529#5144319, @crusnov wrote:
Has anyone checked if the 5xx errors happen to coincide with puppet-merge happening?

I just checked and as far as i can see puppet-merge dose not log so its hard to correlate

Change 524331 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] puppet-merge: possible idea to add some atomic behaviour to puppet-merge

https://gerrit.wikimedia.org/r/524331

Change 524331 abandoned by Jbond:
puppet-merge: possible idea to add some atomic behavior to puppet-merge

Reason:
puppet-mertge has changed significantly since this CR

https://gerrit.wikimedia.org/r/524331

jbond renamed this task from Frequent puppet failures to Frequent puppet failures.Thu, Nov 4, 3:16 PM
jbond removed a project: puppet-compiler.