Page MenuHomePhabricator

integration labs project DNS resolver improperly switched to openstack-designate
Closed, ResolvedPublic


At some point over the last few days, the integration labs project has been switched to the new DNS resolver based on OpenStack Designate has described at

It has a bunch of unwanted side effects and we should revert back to dnsmasq:

  • dnsmasq has aliases set to have public DNS entries to resolve to the instance internal IP ( instead of the public IP. That makes it possible to reach * from labs.
  • The instance fully qualified hostnames are being changed to include the labs project. We use puppetmaster::self , the clients certificate name are based on the hostname and thus the cert is no more recognized by the puppetmaster.

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.

That is caused by puppet change which renames the variable from use_dnsmasq to use_dnsmasq_server. Since it is not set in hiera for the integration project, all instances magically migrated to the new resolver :(

I have set in the hiera configuration:

use_dnsmasq: true
use_dnsmasq_server: true

I guess all puppet client certificates need to regenerated again :((((

So the problem is VERY nasty. All instances have been switched to the new DNS resolver which causes puppet client to fail.

The puppetmaster is f*** up as well, once it is repaired we need on each machine to MANUALLY revert back by editing /etc/resolv.conf:

echo 'domain eqiad.wmflabs                                                                 
search eqiad.wmflabs                                                                   
nameserver' > /etc/resolv.conf
/etc/init.d/nscd restart

Since some instances have been migrated to a puppetmaster named with the labs project name inserted (integration-puppetmaster.integration.eqiad.wmflabs) the certificate is no more valid. Need to revoke it on both client and server and regenerate one.

I think that certificate signing nowadays is handled automatically, so you only need two Puppet runs patience? Also, /etc/resolv.conf should automatically revert back when use_dnsmasq_server is true (again).

hashar claimed this task.

So I have fixed it eventually and that was nasty.

On integration-puppetmaster I have:

  • manually rewrote the /etc/puppet/puppet.conf
  • deleted all ssl directories under /var/lib/puppet
  • manually written /etc/puppet/fileserver.conf and /etc/puppet/auth.conf which were missing for some reason

More madness until I got the puppet agent to run properly on it.

Then on every single instance of the integration project I ran /home/hashar/ :

set -ex
echo 'domain eqiad.wmflabs
search eqiad.wmflabs
nameserver' > /etc/resolv.conf
/etc/init.d/nscd restart
rm -fR /var/lib/puppet/client/ssl/
puppet agent -tv || :
sleep 61
puppet agent -tv

The sleep 61 is because there is a cronjob that auto sign the keys on puppetmaster.

I them manually ran puppet on each instance to confirm they are all properly working.

All good for now.

hashar triaged this task as Unbreak Now! priority.Apr 7 2015, 12:20 PM
hashar set Security to None.

I don't think this was actually caused by that patch; I suspect that it's another example of which the referenced patch may or may not have fixed adequately.

@hashar we had the same problem yesterday in the staging project. The hiera hack should have been a no-op on every environment unless you had use_dnsmasq set to anything other than True in hiera. If use_dnsmasq was not set in hiera, it should have fallen back to the top-level variable set in ldap.

The variable has_dnsmasq is set in ldap. For instance, on integration-slave-trusty-1005

root@staging-palladium:~# ldapsearch -LLL -x -D 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org' -w $(grep -Po "(?<=bindpw).*" /etc/ldap.conf) -b 'ou=hosts,dc=wikimedia,dc=org' "associatedDomain=integration-slave-trusty-1005*"
puppetVar: use_dnsmasq=true

The puppet change, $use_dnsmasq_server = hiera('use_dnsmasq', $::use_dnsmasq) should use the hiera variable use_dnsmasq unless it's not set, in which case it uses the top level variable $::use_dnsmasq which is set in ldap.

@Andrew determined that in staging (and possibly here) $::use_dnsmasq == 'true' which was causing our problem because $::use_dnsmasq needed to be equal to the boolean True which is why we needed the hiera hack.

As to why the variable name changed, the puppet DSL explicitly disallows reassignment:

The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the master as being the master and alter the puppet.conf to remove the [master] section. The puppetmaster process is still around though, it ends up with the SSL cert of the client which is in [main] section.

1Notice: /Stage[main]/Puppet::Self::Config/File[/etc/puppet/puppet.conf.d/10-self.conf]/content:
2--- /etc/puppet/puppet.conf.d/10-self.conf 2015-03-13 19:55:22.551467834 +0000
3+++ /tmp/puppet-file20150409-10851-13ijz3i-0 2015-04-09 12:51:24.041629301 +0000
4@@ -3,7 +3,7 @@
5 [main]
6 logdir = /var/log/puppet
7 vardir = /var/lib/puppet
8-ssldir = /var/lib/puppet/server/ssl
9+ssldir = /var/lib/puppet/client/ssl
10 rundir = /var/run/puppet
11 factpath = $vardir/lib/facter
13@@ -15,23 +15,5 @@
14 postrun_command = /etc/puppet/etckeeper-commit-post
15 pluginsync = true
16 report = true
17-certname = i-0000015c.eqiad.wmflabs
18+certname = i-0000015c.deployment-prep.eqiad.wmflabs
21-bindaddress =
22-ca_md = sha1
23-certname = i-0000015c.eqiad.wmflabs
24-thin_storeconfigs = true
25-templatedir = /etc/puppet/templates
26-modulepath = /etc/puppet/private/modules:/etc/puppet/modules
28-# SSL
29-ssldir = /var/lib/puppet/server/ssl/
30-ssl_client_header = SSL_CLIENT_S_DN
31-ssl_client_verify_header = SSL_CLIENT_VERIFY
32-hostcert = /var/lib/puppet/server/ssl/certs/deployment-salt.eqiad.wmflabs.pem
33-hostprivkey = /var/lib/puppet/server/ssl/private_keys/deployment-salt.eqiad.wmflabs.pem
35-dbadapter = sqlite3
36-external_nodes = /usr/local/bin/
37-node_terminus = exec
accurately describe the diff that happened.

Beta ends up with the same issue T95586

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald Transcript