Page MenuHomePhabricator

Frequent puppet failures
Open, MediumPublic

Description

The last few days we get a lot of icinga alerts related to catalogue fetching:

2019-04-21 00:43:52	<+icinga-wm>	PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 01:28:29	<+icinga-wm>	PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 03:35:21	<+icinga-wm>	PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 03:52:01	<+icinga-wm>	PROBLEM - puppet last run on cloudelastic1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 04:50:05	<+icinga-wm>	PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 06:01:33	<+icinga-wm>	PROBLEM - puppet last run on db1118 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 07:17:49	<+icinga-wm>	PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 08:33:27	<+icinga-wm>	PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 09:32:09	<+icinga-wm>	PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 16:59:07	<+icinga-wm>	PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 17:18:55	<+icinga-wm>	PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 17:50:01	<+icinga-wm>	PROBLEM - puppet last run on logstash1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
2019-04-21 21:04:15     <+icinga-wm>    PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues

I am very sorry, times are in GMT+3.

We had a total of 82 puppet failures where about of 50 are within 3 days. I just run grep on my own logs so those numbers may not exactly accurate.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
OpenNone

Event Timeline

A bit out of date, but

1~ % cat .weechat/logs/irc.znc.\#wikimedia-operations.weechatlog| grep 'Catalog fetch fail' | cut -f1 -d' ' | uniq -c
2 1 2018-11-08
3 2 2018-11-09
4 2 2018-11-10
5 2 2018-11-11
6 2 2018-11-13
7 8 2018-11-14
8 5 2018-11-15
9 1 2018-11-16
10 3 2018-11-17
11 1 2018-11-18
12 5 2018-11-19
13 1 2018-11-20
14 2 2018-11-21
15 5 2018-11-22
16 3 2018-11-23
17 1 2018-11-24
18 2 2018-11-25
19 3 2018-11-26
20 5 2018-11-27
21 9 2018-11-28
22 8 2018-11-29
23 4 2018-11-30
24 1 2018-12-01
25 4 2018-12-02
26 42 2018-12-03
27 5 2018-12-04
28 6 2018-12-05
29 10 2018-12-06
30 2 2018-12-07
31 1 2018-12-08
32 2 2018-12-09
33 2 2018-12-11
34 1 2018-12-12
35 2 2018-12-13
36 2 2018-12-14
37 1 2018-12-15
38 2 2018-12-16
39 1 2018-12-17
40 1 2018-12-18
41 4 2018-12-19
42 1 2018-12-23
43 1 2018-12-27
44 1 2018-12-30
45 4 2019-01-02
46 10 2019-01-03
47 3 2019-01-04
48 2 2019-01-06
49 1 2019-01-07
50 1 2019-01-08
51 1 2019-01-09
52 3 2019-01-10
53 2 2019-01-13
54 2 2019-01-14
55 1 2019-01-15
56 1 2019-01-16
57 1 2019-01-18
58 2 2019-01-19
59 3 2019-01-22
60 19 2019-01-23
61 4 2019-01-24
62 115 2019-01-25
63 1 2019-01-26
64 2 2019-01-27
65 1 2019-01-28
66 1 2019-01-29
67 1 2019-01-30
68 1 2019-01-31
69 1 2019-02-01
70 2 2019-02-02
71 43 2019-02-04
72 8 2019-02-05
73 2 2019-02-06
74 2 2019-02-07
75 1 2019-02-08
76 1 2019-02-10
77 11 2019-02-11
78 2 2019-02-12
79 16 2019-02-13
80 1 2019-02-14
81 4 2019-02-16
82 1 2019-02-19
83 1 2019-02-21
84 1 2019-02-23
85 2 2019-02-24
86 1 2019-02-26
87 2 2019-02-27
88 4 2019-02-28
89 188 2019-03-01
90 2 2019-03-02
91 2 2019-03-05
92 4 2019-03-07
93 2 2019-03-08
94 1 2019-03-09
95 1 2019-03-10
96 2 2019-03-12
97 3 2019-03-14
98 48 2019-03-15
99 1 2019-03-16
100 2 2019-03-17
101 1 2019-03-19
102 4 2019-03-20
103 14 2019-03-21
104 17 2019-03-22
105 13 2019-03-23
106 11 2019-03-24
107 6 2019-03-25
108 15 2019-03-26
109 6 2019-03-27
110 12 2019-03-28
111 19 2019-03-29
112 16 2019-03-30
113 13 2019-03-31
114 12 2019-04-01
115 17 2019-04-02
116 9 2019-04-03

Here are related puppet agent and puppetmaster1001 apache logs from a sampling of hosts

Apr 20 21:38:22 planet1001 puppet-agent[7083]: Using configured environment 'production'
Apr 20 21:38:22 planet1001 puppet-agent[7083]: Retrieving pluginfacts
Apr 20 21:38:22 planet1001 puppet-agent[7083]: Retrieving plugin
Apr 20 21:38:22 planet1001 puppet-agent[7083]: Loading facts
Apr 20 21:38:29 planet1001 puppet-agent[7083]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <html><head>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <title>503 Proxy Error</title>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: </head><body>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <h1>Proxy Error</h1>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: <p>The server is temporarily unable to service your
Apr 20 21:38:29 planet1001 puppet-agent[7083]: request due to maintenance downtime or capacity
Apr 20 21:38:29 planet1001 puppet-agent[7083]: problems. Please try again later.</p>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: </body></html>
Apr 20 21:38:29 planet1001 puppet-agent[7083]: Not using cache on failed catalog
Apr 20 21:38:29 planet1001 puppet-agent[7083]: Could not retrieve catalog; skipping run
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:27.263867 2019] [proxy_http:error] [pid 19553] (70007)The timeout specified has expired: [client 2620:0:861:101:10:64:0:50:41980] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:27.263928 2019] [proxy:error] [pid 19553] [client 2620:0:861:101:10:64:0:50:41980] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/planet1001.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:28.302818 2019] [proxy_http:error] [pid 19553] (70007)The timeout specified has expired: [client 2620:0:861:101:10:64:0:50:41980] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:28.302864 2019] [proxy:error] [pid 19553] [client 2620:0:861:101:10:64:0:50:41980] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/planet1001.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:29.319062 2019] [proxy_http:error] [pid 19553] (70007)The timeout specified has expired: [client 2620:0:861:101:10:64:0:50:41980] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sat Apr 20 21:38:29.319096 2019] [proxy:error] [pid 19553] [client 2620:0:861:101:10:64:0:50:41980] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/planet1001.eqiad.wmnet
Apr 21 06:24:48 mw1307 puppet-agent[40270]: Using configured environment 'production'
Apr 21 06:24:48 mw1307 puppet-agent[40270]: Retrieving pluginfacts
Apr 21 06:24:48 mw1307 puppet-agent[40270]: Retrieving plugin
Apr 21 06:24:49 mw1307 puppet-agent[40270]: Loading facts
Apr 21 06:24:56 mw1307 puppet-agent[40270]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <html><head>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <title>503 Proxy Error</title>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: </head><body>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <h1>Proxy Error</h1>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: <p>The server is temporarily unable to service your
Apr 21 06:24:56 mw1307 puppet-agent[40270]: request due to maintenance downtime or capacity
Apr 21 06:24:56 mw1307 puppet-agent[40270]: problems. Please try again later.</p>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: </body></html>
Apr 21 06:24:56 mw1307 puppet-agent[40270]: Not using cache on failed catalog
Apr 21 06:24:56 mw1307 puppet-agent[40270]: Could not retrieve catalog; skipping run
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:54.633279 2019] [proxy_http:error] [pid 22523] (70007)The timeout specified has expired: [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:54.633337 2019] [proxy:error] [pid 22523] [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/mw1307.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:55.649631 2019] [proxy_http:error] [pid 22523] (70007)The timeout specified has expired: [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:55.649662 2019] [proxy:error] [pid 22523] [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/mw1307.eqiad.wmnet
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:56.663715 2019] [proxy_http:error] [pid 22523] (70007)The timeout specified has expired: [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
/var/log/apache2/error.log.3.gz:[Sun Apr 21 06:24:56.663791 2019] [proxy:error] [pid 22523] [client 2620:0:861:101:1a66:daff:fe99:4975:52544] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/mw1307.eqiad.wmnet
Apr 21 18:00:23 gerrit2001 puppet-agent[9368]: Using configured environment 'production'
Apr 21 18:00:23 gerrit2001 puppet-agent[9368]: Retrieving pluginfacts
Apr 21 18:00:24 gerrit2001 puppet-agent[9368]: Retrieving plugin
Apr 21 18:00:24 gerrit2001 puppet-agent[9368]: Loading facts
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <html><head>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <title>503 Proxy Error</title>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: </head><body>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <h1>Proxy Error</h1>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: <p>The server is temporarily unable to service your
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: request due to maintenance downtime or capacity
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: problems. Please try again later.</p>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: </body></html>
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: Not using cache on failed catalog
Apr 21 18:00:30 gerrit2001 puppet-agent[9368]: Could not retrieve catalog; skipping run
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:28.256441 2019] [proxy_http:error] [pid 20452] (70007)The timeout specified has expired: [client 2620:0:860:4:208:80:153:106:46686] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:28.256516 2019] [proxy:error] [pid 20452] [client 2620:0:860:4:208:80:153:106:46686] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/gerrit2001.wikimedia.org
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:29.271845 2019] [proxy_http:error] [pid 20452] (70007)The timeout specified has expired: [client 2620:0:860:4:208:80:153:106:46686] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:29.271888 2019] [proxy:error] [pid 20452] [client 2620:0:860:4:208:80:153:106:46686] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/gerrit2001.wikimedia.org
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:30.286368 2019] [proxy_http:error] [pid 20452] (70007)The timeout specified has expired: [client 2620:0:860:4:208:80:153:106:46686] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
/var/log/apache2/error.log.2.gz:[Sun Apr 21 18:00:30.286403 2019] [proxy:error] [pid 20452] [client 2620:0:860:4:208:80:153:106:46686] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/gerrit2001.wikimedia.org

I forget where but in digging about this it seems that Puppet will return 503 if it is too busy, there are numerous reports of this (to be clear I don't know if it's puppet itself or an intermediary that returns 503, but the result from the client's perspective is this).

herron triaged this task as Medium priority.Apr 26 2019, 10:44 PM

Have not had time to look at this in depth yet however i did just notice an issue while applying a refactor change[1]

while applying the change set i got the following eror on a few hosts (mw1286 and cp2006) as examples.

Apr 29 15:10:31 mw1286 puppet-agent[17486]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Standard] is already declared; cannot re declare at /etc/puppet/modules/profile/manifests/standard.pp:5 at /etc/puppet/modules/profile/manifests/standard.pp:5:5 on node mw1286.eqiad.wmnet

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

[1]https://gerrit.wikimedia.org/r/c/operations/puppet/+/506990

Have not had time to look at this in depth yet however i did just notice an issue while applying a refactor change[1]

while applying the change set i got the following eror on a few hosts (mw1286 and cp2006) as examples.

Apr 29 15:10:31 mw1286 puppet-agent[17486]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Standard] is already declared; cannot re declare at /etc/puppet/modules/profile/manifests/standard.pp:5 at /etc/puppet/modules/profile/manifests/standard.pp:5:5 on node mw1286.eqiad.wmnet

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

[1]https://gerrit.wikimedia.org/r/c/operations/puppet/+/506990

Has anyone checked if the 5xx errors happen to coincide with puppet-merge happening?

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

The update is definitely not atomic. It's a effectively a loop over all puppetmasters for a git pull. The puppet protocol is essentially first an HTTP POST which returns the compiled catalog as a response, then a number of GETs to fetch the various file resources that have a source parameter (note that resources with a content parameter are already in the compiled catalog). This definitely allows for the following scenario:

  1. The agent POSTs the facts to puppet and gets loadbalanced to puppetmaster X. It awaits the response for the compiled catalog.
  2. puppetmaster X does the compilation and sends a compiled catalog back with a number of file resources with source parameters
  3. A puppet-merge starts updating puppetmasters
  4. The agent reaches out via GETs to obtain the various file resources with source parameters. These are numerous (at some point years ago we counted ~1k IIRC for some agents) and all get load balanced.
  5. Some GETs reach puppetmasters where puppet-merge has happened, some do not.
  6. If the change merged removed/added some files, fetching those will fail as the compiled catalog served will refer to non yet/no more existing files.
  7. icinga will scream.

This is to illustrate that the issue exists, but does not explain the error posted above, which references the exact same file.

! In T221529#5144319, @crusnov wrote:
Has anyone checked if the 5xx errors happen to coincide with puppet-merge happening?

I just checked and as far as i can see puppet-merge dose not log so its hard to correlate

Change 524331 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] puppet-merge: possible idea to add some atomic behaviour to puppet-merge

https://gerrit.wikimedia.org/r/524331

Change 524331 abandoned by Jbond:
puppet-merge: possible idea to add some atomic behavior to puppet-merge

Reason:
puppet-mertge has changed significantly since this CR

https://gerrit.wikimedia.org/r/524331

jbond renamed this task from Frequent puppet failures to Frequent puppet failures.Nov 4 2021, 3:16 PM
jbond removed a project: puppet-compiler.

The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a none atomic update. i.e. if puppet-merge is updating the code repository as at the same time a puppet compile is being preformed then one could have a situation where the catalog is compiled using some files from the production branch pre-merge and some post merges

[1]https://gerrit.wikimedia.org/r/c/operations/puppet/+/506990

We had another issue of this while rolling out an nrpe change. This renamed a [[ URL | variable in a manifest ]] which was later used in a template. for nodes that fail we see diffs like the following

--- /etc/nagios/nrpe.d/get_raid_status_megacli.cfg	2022-01-27 07:35:43.930960052 +0000
+++ /tmp/puppet-file20220525-4082561-xyt3ba	2022-05-25 11:52:25.089350068 +0000
@@ -1,2 +1,2 @@
 # File generated by puppet. DO NOT edit by hand
-command[get_raid_status_megacli]=/usr/bin/sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli -c
\ No newline at end of file
+command[get_raid_status_megacli]=

Notice that the command is missing. When we see this diff we notice that the git version is set to "David Caro - wmcs-k8s-node-upgrade: add some extra logs" and not the correct version "nrpe: manage sudo rules via nrpe::check". The change from dcaro is the version that existed before before the nrpe various. As such we can see from the diff and the fact that puppet reports the wrong version that the puppet compilation used files from two different git commits. in this case it using the manifest file from one commit and the template file from the other commit resulting in the missing variable in the outputted content. In total 4 machines had issue, always have a chance of tickling this race condition however we are much more likley to hit it with a change like this which:

  • changes two seperate files that depend on each other
  • applies to all machines in the fleet
  • applies to multiple resourcse The long term fix is to make puppet-merge atomic.

To resolve this current issue i will disable puppet to re-d3eploy the new change

Change 799943 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] puppet-merge: Add logging so we know when changes where merged

https://gerrit.wikimedia.org/r/799943

Change 799943 merged by Jbond:

[operations/puppet@production] puppet-merge: Add logging so we know when changes where merged

https://gerrit.wikimedia.org/r/799943