Page MenuHomePhabricator

/etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster
Closed, ResolvedPublic

Description

#####################################################################
##### THIS FILE IS MANAGED BY PUPPET
#####  as template('base/puppet.conf.d/10-main.conf.erb')
######################################################################

[main]
logdir = /var/log/puppet
vardir = /var/lib/puppet
ssldir = /var/lib/puppet/ssl
rundir = /var/run/puppet
factpath = $vardir/lib/facter

[agent]
server = labs-puppetmaster-eqiad.wikimedia.org


configtimeout = 960
usecacheonfailure = false
splay = true
prerun_command = /etc/puppet/etckeeper-commit-pre
postrun_command = /etc/puppet/etckeeper-commit-post
pluginsync = true
report = true
reports = statsd
# This file is managed by Puppet!

[main]
logdir = /var/log/puppet
vardir = /var/lib/puppet
ssldir = /var/lib/puppet/client/ssl
rundir = /var/run/puppet
factpath = $vardir/lib/facter

[agent]
server = deployment-puppetmaster.deployment-prep.eqiad.wmflabs
configtimeout = 480
splay = true
prerun_command = /etc/puppet/etckeeper-commit-pre
postrun_command = /etc/puppet/etckeeper-commit-post
pluginsync = true
report = true
certname = deployment-cache-text04.deployment-prep.eqiad.wmflabs

When this happens we can fix the file and puppet works again, but we shouldn't have to do this.
While the file is in this mess, puppet is broken:

krenair@deployment-cache-text04:~$ sudo puppet agent -tv
Error: Could not request certificate: The certificate retrieved from the master does not match the agent's private key.
Certificate fingerprint: FC:8F:F5:15:2D:0E:D2:4E:AB:F8:7E:FA:10:B3:E6:25:E8:B0:30:82:73:E3:06:49:EC:9E:30:75:80:63:0F:2E
To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate.
On the master:
  puppet cert clean deployment-cache-text04.deployment-prep.eqiad.wmflabs
On the agent:
  1a. On most platforms: find /var/lib/puppet/ssl -name deployment-cache-text04.deployment-prep.eqiad.wmflabs.pem -delete
  1b. On Windows: del "/var/lib/puppet/ssl/deployment-cache-text04.deployment-prep.eqiad.wmflabs.pem" /f
  2. puppet agent -t

Event Timeline

Krenair created this task.Apr 14 2016, 2:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2016, 2:03 PM
hashar added a subscriber: hashar.Apr 18 2016, 8:34 PM

The /etc/puppet/puppet.conf file is generated by concatenating files under /etc/puppet/puppet.conf.d

modules/base/templates/puppet.conf.d/10-main.conf.erb
modules/puppet/templates/self.conf.erb

No idea why the 10-main.conf.erb would be generated though :(

Random trace on deployment-cache-text04

Info: Applying configuration version '1460511639'
Info: Computing checksum on file /etc/update-motd.d/05-role-puppetclient
Info: FileBucket got a duplicate file {md5}05ab1166fb3beb3d54a73f86068725fd
Info: /Stage[main]/Motd/File[/etc/update-motd.d/05-role-puppetclient]: Filebucketed /etc/update-motd.d/05-role-puppetclient to puppet with sum 05ab1166fb3beb3d54a73f86068725fd
Notice: /Stage[main]/Motd/File[/etc/update-motd.d/05-role-puppetclient]/ensure: removed
Notice: /Stage[main]/Base::Puppet/Base::Puppet::Config[main]/File[/etc/puppet/puppet.conf.d/10-main.conf]/ensure: created
Info: /Stage[main]/Base::Puppet/Base::Puppet::Config[main]/File[/etc/puppet/puppet.conf.d/10-main.conf]: Scheduling refresh of Exec[delete master certs]
Info: /Stage[main]/Base::Puppet/Base::Puppet::Config[main]/File[/etc/puppet/puppet.conf.d/10-main.conf]: Scheduling refresh of Exec[compile puppet.conf]
Notice: /Stage[main]/Base::Puppet/Exec[compile puppet.conf]: Triggered 'refresh' from 1 events

this just happened twice more :/

Le 21/04/2016 19:59, Krenair a écrit :

this just happened twice more :/

On which instance ? Looking at /var/log/puppet.log might give some tip
as to what is going on.

cache-text04 again, I'll look into it

Change 284852 had a related patch set uploaded (by 20after4):
Fix race in puppet::self (puppet.conf compilation)

https://gerrit.wikimedia.org/r/284852

Krenair assigned this task to mmodell.Apr 22 2016, 4:46 AM

(I went and found the code in puppet earlier but realised it was far out of my depth. Looks like Mukunda is working on this)

I think I found the race condition:

The order of operations is not guaranteed when puppet::self::config sets ensure => absent on 10-main.conf. What happens is we get a puppet.conf containing both 10-main.conf and 10-self.conf

With any luck this patch should fix it.

I'm gonna cherry pick the patch on beta. We'll see if it does the trick.

Actually, come to think of it, I'm not sure if it's enough to have this on deployment-puppetmaster, as puppet::self might be communicating with the labs puppetmaster. Anyway I guess we'll find out if it happens again.

Seems 10-self.conf varies the hostname fqdn :(

Notice: /Stage[main]/Puppet::Self::Config/Base::Puppet::Config[self]/File[/etc/puppet/puppet.conf.d/10-self.conf]/content: 
--- /etc/puppet/puppet.conf.d/10-self.conf	2016-04-19 14:52:19.795862882 +0000
+++ /tmp/puppet-file20160423-15039-1sej3kk	2016-04-23 09:22:04.639592608 +0000
@@ -8,12 +8,12 @@
 factpath = $vardir/lib/facter
 
 [agent]
-server = integration-puppetmaster.eqiad.wmflabs
+server = integration-puppetmaster.integration.eqiad.wmflabs
 configtimeout = 480
 splay = true
 prerun_command = /etc/puppet/etckeeper-commit-pre
 postrun_command = /etc/puppet/etckeeper-commit-post
 pluginsync = true
 report = true
-certname = integration-slave-trusty-1013.eqiad.wmflabs
+certname = integration-slave-trusty-1013.integration.eqiad.wmflabs

Both coming from $::fqdn. So maybe resolution is sometime off.

deployment-cache-text04:/etc/puppet/puppet.conf.d/10-main.conf has reappeared

deployment-cache-text04 had the issue again. So at first there are two puppet snippets:

# ls -l /etc/puppet/puppet.conf.d/
total 8
-r--r--r-- 1 root root 646 Apr 22 14:40 10-main.conf
-r--r--r-- 1 root root 488 Jul 29  2015 10-self.conf

With 10-main.conf being wrong, I have deleted it.

I then dropped /var/lib/puppet/ssl and /var/lib/puppet/client/ssl. Ran puppet again.

hashar triaged this task as Normal priority.Apr 27 2016, 11:41 AM

https://gerrit.wikimedia.org/r/#/c/284852/ might solve the issue and ensure 10-main.conf is gone before compiling the puppet.conf.

Mentioned in SAL [2016-04-27T11:43:17Z] <hashar> fixed puppet on deployment-cache-text04 T132689

mmodell removed mmodell as the assignee of this task.Jul 20 2016, 5:21 PM

I'm not sure if my patch fixed this, has anyone seen this happening anymore?

I havent seen that occurring in a while on either deployment-prep or integration labs projects.

What is left to do is to get the Puppet patch https://gerrit.wikimedia.org/r/#/c/284852/ to be reviewed/agreed by ops. It is not so trivial though but at least proven to work fine for labs project having their own puppetmaster.

I saw it just a few days ago in deployment-prep...

Change 284852 merged by Dzahn:
Fix race in puppet::self (puppet.conf compilation)

https://gerrit.wikimedia.org/r/284852

hashar closed this task as Resolved.Aug 25 2016, 8:12 AM
hashar assigned this task to mmodell.
hashar added a subscriber: Dzahn.

Thanks @Dzahn

https://gerrit.wikimedia.org/r/284852 largely fixed most of the race condition but as pointed out by @AlexMonk-WMF it still happened in July 21. Maybe that was a temporary glitch.

I am assuming it is properly fixed now. If we step on it again, we can always reopen and reinvestigate.

Thanks @mmodell for the not so trivial patch.