Page MenuHomePhabricator

Sporadic puppet failures
Closed, ResolvedPublic

Description

Every few days we get a page about puppet failing on a labvirt. Any time I actually try a puppet run, though, it recovers. What's happening?

[UPDATE Oct. 6th]
Today we got more failures that were widespread on multiple hosts, hence renaming the task, adding SRE and adding a summary on T201247#4647722

Event Timeline

Today's was labvirt1008, and the error was:

(/Stage[main]/Base::Environment/File[/etc/profile.d/mysql-ps1.sh]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/environment/mysql-ps1.sh: end of file reached
Andrew triaged this task as Medium priority.Aug 4 2018, 8:35 PM

labvirt1006/puppet last run is CRITICAL <-- just now

Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML
 2.0//EN">
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <html><head>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <title>503 Proxy Error</title>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: </head><body>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <h1>Proxy Error</h1>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <p>The server is temporarily unable to service your
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: request due to maintenance downtime or capacity
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: problems. Please try again later.</p>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: </body></html>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: Not using cache on failed catalog
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: Could not retrieve catalog; skipping run

So, this would imply that the error is a 503 on puppetmaster1001.eqiad.wmnet?

Aug 6 22:12:05 puppetmaster1001 puppet-master[16473]: Could not find resource 'File[/etc/ferm/conf.d]' in parameter 'require'
Aug 6 22:12:05 puppetmaster1001 puppet-master[16473]: (at /etc/puppet/modules/ferm/manifests/service.pp:23)
Aug 6 22:12:05 puppetmaster1001 puppet-master[16473]: Could not find resource 'Service[ferm]' in parameter 'notify'
Aug 6 22:12:05 puppetmaster1001 puppet-master[16473]: (at /etc/puppet/modules/ferm/manifests/service.pp:24)
Aug 6 22:12:05 puppetmaster1001 puppet-master[16473]: Compiled catalog for labvirt1006.eqiad.wmnet in environment production in 4.48 seconds

That doesn't quite look like a fail...

[Mon Aug 06 22:11:59.926263 2018] [proxy_http:error] [pid 25300] (70007)The timeout specified has expired: [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:49224] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141
[Mon Aug 06 22:11:59.926329 2018] [proxy:error] [pid 25300] [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:49224] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/labvirt1006.eqiad.wmnet
[Mon Aug 06 22:12:00.942148 2018] [proxy_http:error] [pid 25300] (70007)The timeout specified has expired: [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:49224] AH01102: error reading status line from remote server puppetmaster1002.eqiad.wmnet:8141
[Mon Aug 06 22:12:00.942206 2018] [proxy:error] [pid 25300] [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:49224] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/labvirt1006.eqiad.wmnet
App 26337 stdout:
[Mon Aug 06 22:12:01.956792 2018] [proxy_http:error] [pid 25300] (70007)The timeout specified has expired: [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:49224] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
[Mon Aug 06 22:12:01.956831 2018] [proxy:error] [pid 25300] [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:49224] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/labvirt1006.eqiad.wmnet

This seems to be the relevant issue.

So this is, in fact, an apache proxy timeout on the puppet master. We just happen to get alerted for it.

Happened again today:

Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <html><head>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <title>503 Proxy Error</title>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: </head><body>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <h1>Proxy Error</h1>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: <p>The server is temporarily unable to service your
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: request due to maintenance downtime or capacity
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: problems. Please try again later.</p>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: </body></html>
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: Not using cache on failed catalog
Aug  6 22:12:01 labvirt1006 puppet-agent[55835]: Could not retrieve catalog; skipping run
Oct  6 00:58:00 labvirt1012 puppet-agent[48199]: Using configured environment 'production'
Oct  6 00:58:00 labvirt1012 puppet-agent[48199]: Retrieving pluginfacts
Oct  6 00:58:00 labvirt1012 puppet-agent[48199]: Retrieving plugin
Oct  6 00:58:01 labvirt1012 puppet-agent[48199]: Loading facts
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: <html><head>
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: <title>503 Proxy Error</title>
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: </head><body>
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: <h1>Proxy Error</h1>
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: <p>The server is temporarily unable to service your
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: request due to maintenance downtime or capacity
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: problems. Please try again later.</p>
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: </body></html>
Oct  6 00:58:13 labvirt1012 puppet-agent[48199]: Not using cache on failed catalog
Volans renamed this task from Sporadic labvirt puppet failures to Sporadic puppet failures.Oct 6 2018, 9:13 PM
Volans raised the priority of this task from Medium to High.
Volans added a project: SRE.
Volans updated the task description. (Show Details)
Volans added a subscriber: herron.

Today we got more widespread failures, and @Joe and I had a look at it, we also tried to restart apache2 on puppetmaster1001 without much success, although the issue mostly recovered after but still we got at least one failure.

Looking at the logs and apache2 documentation I think we should try to remove the ping parameter in the proxy configuration in modules/puppetmaster/templates/web-frontend.conf.erb:

BalancerMember https://<%= workerhash['worker'] %>:8141 ping=1 connectiontimeout=1 retry=500 timeout=900 <%- if workerhash['loadfactor'] -%>loadfactor=<%= workerhash['loadfactor'] %><%- end %>

I'm suggesting this because Apache2 documentation reports (see https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypass ):

Ping: For HTTP, it causes mod_proxy_http to send a 100-Continue to the backend. [...SNIP...] This feature has been added to avoid problems with hung and busy backends. This will increase the network traffic during the normal operation which could be an issue, but it will lower the traffic in case some of the cluster nodes are down or busy.

And from the logs we get for each failure something like:

[Sat Oct 06 20:55:37.282152 2018] [proxy_http:error] [pid 22820] (70007)The timeout specified has expired: [client 2620:0:861:101:1602:ecff:fe07:31c4:41382] AH01102: error reading status line from remote server puppetmaster1001.eqiad.wmnet:8141
[Sat Oct 06 20:55:37.282180 2018] [proxy:error] [pid 22820] [client 2620:0:861:101:1602:ecff:fe07:31c4:41382] AH00898: Timeout on 100-Continue returned by /puppet/v3/catalog/elastic1033.eqiad.wmnet

I'm not sure that the timeout on the 100-Continue is the only error as it logs two lines but I think it's worth to try to remove the ping parameter and see how it goes. We already have connectiontimeout and timeout to protect us and from the doc it seems that ping is a totally optional additional option.

We don't have a nice graph in grafana for this, so for reference I'm leaving this:

root@puppetmaster1001:/var/log/apache2# zgrep -c 'Timeout on 100-Continue returned' error.log*
error.log:70
error.log.1:60
error.log.2.gz:0
error.log.3.gz:0
error.log.4.gz:6
error.log.5.gz:0
error.log.6.gz:3
error.log.7.gz:0

[unrelated] I think also that we should revisit the retry parameter, at least checking what is the behaviour in case during the the provided retry period (500s in our case) all backends have 1 failure and are marked 'failed'.

My 2 cents. Thoughts?

IMO removing ping is worth a shot, but I also wouldn't be surprised to find the timeout on 100-Continue was a side-effect of another issue.

In addition to the 100-Continue errors, I'm also seeing end of file found errors logged by the puppet masters to the effect of:

[Tue Oct 09 09:07:21.073836 2018] [proxy_http:error] [pid 24186] (70014)End of file found: [client 10.64.20.36:51590] AH01102: error reading status line from remote server rhodium.eqiad.wmnet:8141

These are happening more frequently on the dedicated backends, which should be explained by their higher loadfactor in the apache reverse proxy config.

puppetmaster1001:~# zgrep -i 'end of file' /var/log/apache2/error.log* | awk '{ print $23 }' | sort | uniq -c | sort -n
     13 puppetmaster1001.eqiad.wmnet:8141
     36 rhodium.eqiad.wmnet:8141
     42 puppetmaster1002.eqiad.wmnet:8141
puppetmaster2001:~# zgrep -i 'end of file' /var/log/apache2/error.log* | awk '{ print $23 }' | sort | uniq -c | sort -n
      8 puppetmaster2001.codfw.wmnet:8141
     30 puppetmaster2002.codfw.wmnet:8141

Since these errors are occurring in both sites with similar regularity I wonder if the cause is a common factor between all puppet masters?

Interestingly there are mysql timeout errors being logged into apache error.log regularly as well.

puppetmaster1001:~# zgrep -i 'mysql error' /var/log/apache2/error.log
App 24678 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 3160 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 3182 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 5166 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 3170 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 3104 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 14111 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 32039 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
puppetmaster2001:~# grep -i 'mysql error' /var/log/apache2/error.log
App 23574 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 14790 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 25398 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction
App 30917 stdout: Mysql error: 1205, Lock wait timeout exceeded; try restarting transaction

These mysql errors aren't logged with timestamps which makes it hard to say exactly when they happen and if they correlate directly. Still, I wonder if master backends become blocked waiting on a lock and this eventually/occasionally results in mod_proxy errors? Both eqiad and codfw masters do have the dbserver = m1-master.eqiad.wmnet puppet.conf setting in common.

Things seem better this week! Is that my imagination?

Spoke too soon, got another failure overnight.

Oct 23 06:25:20 labvirt1017 puppet-agent[161569]: (/Stage[main]/Openstack::Nova::Common::Base/File[/etc/nova/policy.json]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/openstack/mitaka/nova/common/policy.json: end of file reached

*bump* -- I'm interested on if anyone is working on fixing these issues. If not, that's fine but I'll put some more time into ensuring that we don't get pages for them :)

Krenair renamed this task from Sporadic puppet failures to Sporadic puppet failures on labvirt hosts.Oct 30 2018, 7:37 PM
Krenair renamed this task from Sporadic puppet failures on labvirt hosts to Sporadic puppet failures.

@Andrew did it reoccurred during last week? do you have a list of hostnames+time by any chance?

Spoke too soon, got another failure overnight.

Oct 23 06:25:20 labvirt1017 puppet-agent[161569]: (/Stage[main]/Openstack::Nova::Common::Base/File[/etc/nova/policy.json]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/openstack/mitaka/nova/common/policy.json: end of file reached

Note that 06:25 is cron.daily's time during which logrotate etc. run. The previous reports were for different times of the day though, so it may or may not be relevant.

@Volans, I don't have timestamps, but I do have this from our weekly meeting alert summary:

2018-10-27: labvirt1014 puppet transient page
2018-10-28: labvirt1017 puppet transient page
2018-10-30: labvirt1015 puppet transient page
2018-10-30: labvirt1010 puppet transient page

*bump* we're still getting these and my team is increasingly bleary and disoriented by all the middle-of-the-night pages. The most recent one was last night:

Nov  5 03:57:02 labvirt1013 puppet-agent-cronjob: Sleeping 13 for random splay
Nov  5 03:57:21 labvirt1013 puppet-agent-cronjob: INFO:debmonitor:Found 862 installed binary packages
Nov  5 03:57:21 labvirt1013 puppet-agent-cronjob: INFO:debmonitor:Found 103 upgradable binary packages (including new dependencies)
Nov  5 03:57:22 labvirt1013 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server
Nov  5 03:57:26 labvirt1013 puppet-agent[3844]: Using configured environment 'production'
Nov  5 03:57:26 labvirt1013 puppet-agent[3844]: Retrieving pluginfacts
Nov  5 03:57:26 labvirt1013 puppet-agent[3844]: Retrieving plugin
Nov  5 03:57:27 labvirt1013 puppet-agent[3844]: Loading facts
Nov  5 03:57:41 labvirt1013 puppet-agent[3844]: Caching catalog for labvirt1013.eqiad.wmnet
Nov  5 03:57:42 labvirt1013 puppet-agent[3844]: (/Stage[main]/Base::Environment/Tidy[/var/tmp/core]) Tidying 0 files
Nov  5 03:57:43 labvirt1013 puppet-agent[3844]: Applying configuration version '1541390255'
Nov  5 03:57:44 labvirt1013 crontab[5479]: (root) LIST (root)
Nov  5 03:57:44 labvirt1013 crontab[5482]: (root) LIST (prometheus)
Nov  5 03:57:44 labvirt1013 crontab[5485]: (root) LIST (debmonitor)
Nov  5 03:57:45 labvirt1013 puppet-agent[3844]: (/Stage[main]/Base::Environment/File[/etc/profile.d/bash_autologout.sh]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/environment/bash_autologout.sh: end of file reached
Nov  5 03:57:55 labvirt1013 puppet-agent[3844]: Applied catalog in 12.98 seconds

The next run proceeded without incident:

Nov  5 04:28:04 labvirt1013 puppet-agent[29467]: Using configured environment 'production'
Nov  5 04:28:04 labvirt1013 puppet-agent[29467]: Retrieving pluginfacts
Nov  5 04:28:04 labvirt1013 puppet-agent[29467]: Retrieving plugin
Nov  5 04:28:05 labvirt1013 puppet-agent[29467]: Loading facts
Nov  5 04:28:18 labvirt1013 puppet-agent[29467]: Caching catalog for labvirt1013.eqiad.wmnet
Nov  5 04:28:20 labvirt1013 puppet-agent[29467]: (/Stage[main]/Base::Environment/Tidy[/var/tmp/core]) Tidying 0 files
Nov  5 04:28:20 labvirt1013 puppet-agent[29467]: Applying configuration version '1541392092'
Nov  5 04:28:32 labvirt1013 puppet-agent[29467]: Applied catalog in 12.85 seconds

After looking at the logs in puppetmaster1001/1002 and rhodium, I don't have much to add, unfortunately.

Just listing the history of where the bash_autologout.sh file was retrieved from in case it gives anyone an insight (notice it didn't reach puppetmaster1001 for a long time -- pure chance? And the 3:57 failure is not logged so perhaps the connection was closed before the puppetmaster even replied?):

2018-11-04T06:28:20 puppetmaster1002
2018-11-04T06:58:32 rhodium
2018-11-04T07:28:09 rhodium
2018-11-04T07:58:12 puppetmaster1002
2018-11-04T08:28:17 puppetmaster1002
2018-11-04T08:57:42 rhodium
2018-11-04T09:28:13 puppetmaster1002
2018-11-04T09:57:46 puppetmaster1001
2018-11-04T10:28:16 rhodium
2018-11-04T10:58:11 puppetmaster1002
2018-11-04T11:28:05 puppetmaster1001
2018-11-04T11:58:25 puppetmaster1002
2018-11-04T12:27:36 rhodium
2018-11-04T12:57:49 rhodium
2018-11-04T13:27:55 puppetmaster1002
2018-11-04T13:57:55 puppetmaster1001
2018-11-04T14:27:39 rhodium
2018-11-04T14:57:53 puppetmaster1001
2018-11-04T15:27:50 puppetmaster1002
2018-11-04T15:57:43 puppetmaster1002
2018-11-04T16:28:25 rhodium
2018-11-04T16:58:11 rhodium
2018-11-04T17:28:24 puppetmaster1002
2018-11-04T17:58:27 rhodium
2018-11-04T18:27:57 puppetmaster1001
2018-11-04T18:57:52 puppetmaster1002
2018-11-04T19:28:09 puppetmaster1002
2018-11-04T19:58:29 rhodium
2018-11-04T20:28:14 puppetmaster1001
2018-11-04T20:57:54 puppetmaster1002
2018-11-04T21:27:33 rhodium
2018-11-04T21:58:10 puppetmaster1002
2018-11-04T22:28:10 puppetmaster1001
2018-11-04T22:57:59 puppetmaster1002
2018-11-04T23:27:49 puppetmaster1002
2018-11-04T23:58:02 puppetmaster1001
2018-11-05T00:28:09 rhodium
2018-11-05T00:58:14 puppetmaster1002
2018-11-05T01:27:56 puppetmaster1002
2018-11-05T01:58:28 rhodium
2018-11-05T02:28:10 rhodium
2018-11-05T02:57:56 rhodium
2018-11-05T03:27:39 rhodium
2018-11-05T04:28:22 puppetmaster1002
2018-11-05T04:57:38 rhodium
2018-11-05T05:28:10 rhodium
2018-11-05T05:57:49 rhodium
`
jbond removed a subscriber: jbond.
Andrew claimed this task.

Haven't seen this in ages.