Page MenuHomePhabricator

New instances are stuck in "The certificate retrieved from the master does not match the agent's private key."
Closed, DeclinedPublic

Description

I created a (vanilla, no configuration, smallest image) instance yesterday ("icinga-scfc-test") and the initial Puppet run didn't finish. So I waited several hours, but still no luck. Manual Puppet runs ("sudo puppetd -tv") showed:

err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key.
Certificate fingerprint: 05:91:9E:EE:6C:28:8B:24:FE:19:39:66:03:93:6C:44
To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate.
On the master:
puppet cert clean i-00000906.pmtpa.wmflabs
On the agent:
rm -f /var/lib/puppet/ssl/certs/i-00000906.pmtpa.wmflabs.pem
puppet agent -t

I deleted the instance, created another one ("icinga-scfc-test2"), and ran into the same situation again. I deleted the instance, created a bigger one ("icinga-scfc-test3"), and the error occured there as well (after waiting several hours in each case).

I reported this error in December (cf. http://permalink.gmane.org/gmane.org.wikimedia.labs/1976), but then it apparently resolved itself after waiting (cf. http://permalink.gmane.org/gmane.org.wikimedia.labs/1977), while now there doesn't seem to be any light at the end of the tunnel. (i-00000906 is not (and was not then) the name of any of the created instances, but refers to labs-vmbuilder-precise.)


Version: unspecified
Severity: blocker

Details

Reference
bz61413

Event Timeline

bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz61413.
bzimport added a subscriber: Unknown Object (MLST).
scfc created this task.Feb 15 2014, 8:30 AM

I just now tried creating a new instance, and it came up fine, and the puppet cert worked. I do see the error on icinga-scfc-test3, though.

I want to think that this is some kind of occasional error that happens when there's an ID collision or when an old ID is used. But the fact that it's complaining about an ID different from the instance is very strange. I'll investigate further... in the meantime, though, if you create yet another instance, most likely it'll work :/

OK, on a working instance:

ls -ltra /var/lib/puppet/ssl/certs

total 16
-rw-r--r-- 1 puppet puppet 847 Feb 15 08:42 ca.pem
-rw-r--r-- 1 puppet puppet 883 Feb 15 08:43 i-00000a65.pmtpa.wmflabs.pem

On icinga-scfc-test3:

ls -ltra /var/lib/puppet/ssl/certs

total 20
-rw-r--r-- 1 puppet puppet 847 Feb 14 21:31 ca.pem
-rw-r----- 1 puppet puppet 883 Feb 14 21:32 i-00000a64.pmtpa.wmflabs.pem
-rw-r--r-- 1 puppet puppet 883 Feb 14 21:35 i-00000906.pmtpa.wmflabs.pem

Now my theory is that early in its life an instance thinks that its ID is i-00000906 (inherited by mistake from the original image build), and that if a user forces a puppet run during that early stage it tries to create a cert for the wrong ID and is forever after doomed. Is that possibly what happened here? Changing the certname in /etc/puppet/puppet.conf to the actual instance ID seems to resolve the problem.

(Another possibility, testing a weaker theory -- were specific puppet classes selected via the wikitech GUI before this instance was able to complete a puppet run?)

scfc added a comment.Feb 15 2014, 11:53 AM

Let's start with the last bit: No, I didn't even open the configuration field :-).

I usually run "sudo puppetd -tv" on the first login just because the initial motd is still the Ubuntu one.

But just now I created another instance ("ici"; the web form is really quick to react to Enter keys :-)), logged in, looked for a Puppet agent running ("ps auxfwww | fgrep puppet"), found none, looked in /var/lib/puppet/ssl/certs:

scfc@ici:~$ sudo ls -l /var/lib/puppet/ssl/certs
total 8
-rw-r--r-- 1 puppet puppet 847 Feb 15 11:37 ca.pem
-rw-r----- 1 puppet puppet 883 Feb 15 11:38 i-00000a66.pmtpa.wmflabs.pem
scfc@ici:~$

and ran Puppet - boom:

scfc@ici:~$ sudo puppetd -tv
info: Creating a new SSL key for i-00000906.pmtpa.wmflabs
info: Caching certificate for i-00000906.pmtpa.wmflabs
err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key.
Certificate fingerprint: 05:91:9E:EE:6C:28:8B:24:FE:19:39:66:03:93:6C:44
To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate.
On the master:
puppet cert clean i-00000906.pmtpa.wmflabs
On the agent:
rm -f /var/lib/puppet/ssl/certs/i-00000906.pmtpa.wmflabs.pem
puppet agent -t
Exiting; failed to retrieve certificate and waitforcert is disabled
scfc@ici:~$

Afterwards:

scfc@ici:~$ sudo ls -l /var/lib/puppet/ssl/certs
total 12
-rw-r--r-- 1 puppet puppet 847 Feb 15 11:37 ca.pem
-rw-r----- 1 puppet puppet 883 Feb 15 11:46 i-00000906.pmtpa.wmflabs.pem
-rw-r----- 1 puppet puppet 883 Feb 15 11:38 i-00000a66.pmtpa.wmflabs.pem
scfc@ici:~$

So I created another instance ("ici2"), didn't do anything but looked at /etc/puppet/puppet.conf:

certname = i-00000906.pmtpa.wmflabs

Should that really be there?

I've looked at this quite a bit now, but still have no good solution. The problem seems limited to this particular project. I suspect that the very first step of instance startup ('firstboot.sh') is not running, since that should set up puppet.conf properly. That or there's some kind of early ldap failure. I'll look at this more as soon as i have a chance.

So... here's what I think is happening:

  1. Puppet can't run on instances in the 'nagios' project. I think this is because of a name conflict... there seems to be a new /etc/sudoers.d/nagios file defined in production puppet which collides with the standard /etc/sudoers.d/<projectname> sudoers file. (All of this is speculation, I haven't looked for the offending class yet.)
  1. Being unable to complete a puppet run, puppet.conf was never updated by puppet.
  1. #2 shouldn't have mattered because in theory our image automatically sets up puppet.conf. But the image was broken due to the issue fixed in https://gerrit.wikimedia.org/r/#/c/113788/

Which that fix, new instances now throw the following error, which is what led me to speculate about step 1:

err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/etc/sudoers.d/nagios] is already defined in file /etc/puppet/manifests/sudo.pp at line 11; cannot redefine at /etc/puppet/manifests/sudo.pp:23 on node i-00000a70.pmtpa.wmflabs

Assuming this is still a problem, does somebody plan to work on this or is everybody busy with the Eqiad migration?

I'm not actively working on it. The easy fix is to not have a project called 'nagios' :)

So far no one has claimed ownership of the 'nagios' project, which means it will probably be shut down in the migration, at which point this will be largely moot I think.

(In reply to Andrew Bogott from comment #7)

So far no one has claimed ownership of the 'nagios' project, which means it
will probably be shut down in the migration

Do you know if this happened, and does this make this ticket obsolete?

abogott: Do you know if this happened, and does this make this ticket obsolete?

Andrew added a comment.Jul 4 2014, 4:18 PM

Yes, I just now deleted the 'nagios' project as it had no instances.