Page MenuHomePhabricator

New jessie instance can't attach to puppet due to wrong certname
Closed, ResolvedPublic

Description

I have created a new Jessie instance integration-lightslave-jessie-1002 which points to a self puppet master. The puppet conf has:

[agent]
server = integration-puppetmaster.eqiad.wmflabs
certname = i-00000cdb.eqiad.wmflabs

Running puppet I am yield:

Warning: Server hostname 'integration-puppetmaster.eqiad.wmflabs' did not match server certificate;
expected one of
i-00000a4c.integration.eqiad.wmflabs,
DNS:i-00000a4c.integration.eqiad.wmflabs,
DNS:integration-puppetmaster.integration.eqiad.wmflabs,
DNS:puppet,
DNS:puppet.integration.eqiad.wmflabs

/etc/resolv.conf has:

domain integration.eqiad.wmflabs
search integration.eqiad.wmflabs eqiad.wmflabs
nameserver 208.80.154.20
options timeout:5 ndots:2

Hiera:integration has:

classes:
   - role::puppet::self
"role::puppet::self::master": integration-puppetmaster
"role::puppet::self::enc": yaml+ldap

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added projects: Cloud-Services, Cloud-VPS.
hashar added a subscriber: hashar.

Is puppetmaster running latest puppet?

hashar set Security to None.
hashar removed a subscriber: yuvipanda.

I have changed in Hiera:integration the puppet master from 'integration-puppetmaster' to 'integration-puppetmaster.integration.eqiad.wmflabs' https://phabricator.wikimedia.org/T102108

Then manually tweaked /etc/puppet/puppet.conf:

[agent]
- server = integration-puppetmaster.eqiad.wmflabs
+ server = integration-puppetmaster.integration.eqiad.wmflabs

The connection works but the server side catalog compilation fails:

Error: Could not retrieve catalog from remote server:
Error 400 on SERVER: must be a simple hostname.  The project-specific domain will be automatically appended. at /etc/puppet/manifests/role/puppet.pp:67 on node i-00000cdb.eqiad.wmflabs

Which is role::puppet::self:

if $master != undef {
    if $master =~ /\./ {
        fail("$::puppetmaster must be a simple hostname.  The project-specific domain will be automatically appended.")
    }

https://gerrit.wikimedia.org/r/#/c/215333/ asserts that role::puppet::self::master is set to a simple hostname (no fqdn). @yuvipanda reverted the Hiera:Integration change ( https://wikitech.wikimedia.org/w/index.php?title=Hiera:Integration&action=history ).

yuvipanda claimed this task.

Seems to work now.

Puppet did:

 [agent]
-server = integration-puppetmaster.eqiad.wmflabs
+server = integration-puppetmaster.integration.eqiad.wmflabs
 configtimeout = 480
 splay = true
 prerun_command = /etc/puppet/etckeeper-commit-pre
 postrun_command = /etc/puppet/etckeeper-commit-post
 pluginsync = true
 report = true
-certname = i-00000cdb.eqiad.wmflabs
+certname = i-00000cdb.integration.eqiad.wmflabs

Seems like this must have been due to an un-updated puppet repo on the master. Is that right?

hashar reopened this task as Open.EditedJun 11 2015, 2:00 PM

I have created a new instance integration-t102108.eqiad.wmflabs . Let see what happens.

integration-t102108 is a Puppet client of integration-puppetmaster.eqiad.wmflabs (puppetclient)

Same error:

Info: Caching certificate for i-00000cdc.eqiad.wmflabs
Error: Could not request certificate:
Server hostname 'integration-puppetmaster.eqiad.wmflabs' did not match server certificate; expected one of
  i-00000a4c.integration.eqiad.wmflabs,
  DNS:i-00000a4c.integration.eqiad.wmflabs,
  DNS:integration-puppetmaster.integration.eqiad.wmflabs,
  DNS:puppet, DNS:puppet.integration.eqiad.wmflabs

Exiting; failed to retrieve certificate and waitforcert is disabled

Was provisioned with puppet.conf agent.server = integration-puppetmaster.eqiad.wmflabs which is not in the least of expected entries.

There is one based on the ec2id: i-00000a4c.integration.eqiad.wmflabs

Another one using the new dns scheme: integration-puppetmaster.integration.eqiad.wmflabs


I guess you need to adjust modules/labs_bootstrapvz/files/firstboot.sh and generate a new Jessie image. It has:

domain=`hostname -d`   # now yields integration.eqiad.wmflabs
if [ "${domain}" == "eqiad.wmflabs" ]
then
    master="labs-puppetmaster.wikimedia.org"
    master_secondary="labcontrol2001.wikimedia.org"
fi

So I guess at this point $master and $master_secondary are unset. But:

sed -i "s/_FQDN_/${idfqdn}/g" /etc/puppet/puppet.conf
sed -i "s/_MASTER_/${master}/g" /etc/puppet/puppet.conf

idfqdn being set to i-00000cdc.integration.eqiad.wmflabs

I think the issue is that new instances have the wrong domain, which means they can't contact the puppet server, which means they can't fix their domain.

I'll be building a new base image today -- we'll check and make sure it helps.

I'm still fighting with Jessie, but can you validate that the Trusty (testing) image works properly for you?

OK, I'm confident this is resolved with the new images. Reopen if you still can't get it to work.

On the integration project which uses custom salt and puppet masters, I have created an instance for Jessie and one for Trusty. None works :-D

integration-t102108-jessie-new

Horizon console

integration-t102108-trusty-new

Horizon console

Might be some oddity with our puppet/salt masters configurations though.

From a discussion with Andrew:

Seems to be caused by the puppet signer on labcontrol1001 . Instances hit the generic puppet master to be able to apply ::self. The signer does not always which is filled as T102193.

I deleted the previous instances and created two fresh ones with puppet.git up-to-date.

integration-t102108-trusty-new2 https://horizon.wikimedia.org/project/instances/ee1fa9f9-0aac-4d0b-97e0-dbff0619ff0f/console

integration-t102108-jessie-new2 https://horizon.wikimedia.org/project/instances/e4866b5a-2407-40fe-b333-43dbdae740de/console

They both suffer from a 3 minutes boot delay because there is no NFS server available. The shared NFS has been disabled on integration T90610#1344487

Filling another task.

hashar reassigned this task from yuvipanda to Andrew.

The puppet signer has been fixed by Andrew. The two instances I created (suffixed -new2) are both working fine as far as I can tell.

I have deleted the two tests instances I created on the integration labs project.