Page MenuHomePhabricator

osmit-tre - Puppet error ("no certificate found and waitforcert is disabled")
Closed, ResolvedPublic

Description

Puppet error:

root@osmit-tre:~# puppet agent -t
Exiting; no certificate found and waitforcert is disabled

Event Timeline

labpuppetmaster1001:

labpuppetmaster1001# puppet cert list --all | grep osmit-tre
  "osmit-tre.eqiad.wmflabs"                                                        (SHA256) 76:02:0D:03:65:F2:07:DB:AD:43:1A:74:D3:70:C3:4F:12:EE:64:9D:6F:32:B6:01:16:DE:E8:D9:E2:09:76:83
+ "osmit-tre.osmit.eqiad.wmflabs"                                                  (SHA256) AB:A2:1A:39:30:26:81:95:43:72:B8:11:A8:E3:D6:45:D4:C2:EC:3D:64:02:E4:F7:72:18:07:73:16:9F:48:A9

Tried to create a new certificate:

root@osmit-tre:~# rm -rf /var/lib/puppet/ssl
root@osmit-tre:~# puppet agent -t
Info: Creating a new SSL key for osmit-tre.eqiad.wmflabs
Info: Caching certificate for ca
Info: csr_attributes file loading from /etc/puppet/csr_attributes.yaml
Info: Creating a new SSL certificate request for osmit-tre.eqiad.wmflabs
Info: Certificate Request fingerprint (SHA256): 00:85:87:7A:38:F2:CD:22:25:98:F8:87:D6:1D:26:52:BC:07:AB:6E:AA:B7:31:8C:04:8C:96:D1:6C:76:16:7A
Info: Caching certificate for ca
Exiting; no certificate found and waitforcert is disabled

Notice the certificate is created for host osmit-tre.eqiad.wmflabs instead of osmit-tre.osmit.eqiad.wmflabs

Cleaned up all certificates again and noticed the VM was missing an entry in /etc/hosts for itself.

root@osmit-tre:~# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 ubuntu.openstack.eqiad.wmflabs ubuntu

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

172.16.3.167 osmit-tre.osmit.eqiad.wmflabs osmit-tre # Added this manually

Rebooted VM and tried to generate a new cert:

$ ssh osmit-tre.osmit.eqiad.wmflabs -l root
Linux osmit-tre 3.13.0-162-generic #212-Ubuntu SMP Mon Oct 29 12:08:50 UTC 2018 x86_64
Ubuntu 14.04.5 LTS
The last Puppet run was at Tue Nov 27 14:06:27 UTC 2018 (5610 minutes ago). 
Last login: Sat Dec  1 11:14:51 2018 from bastion-restricted-01.bastion.eqiad.wmflabs

root@osmit-tre:~# rm -rf /var/lib/puppet/ssl/

root@osmit-tre:~# puppet agent -t
Info: Creating a new SSL key for osmit-tre.osmit.eqiad.wmflabs
Info: Caching certificate for ca
Info: csr_attributes file loading from /etc/puppet/csr_attributes.yaml
Info: Creating a new SSL certificate request for osmit-tre.osmit.eqiad.wmflabs
Info: Certificate Request fingerprint (SHA256): 7B:EB:0D:5B:15:30:42:BB:D5:AA:75:29:92:B6:DC:07:8B:EF:5C:C8:60:F7:42:70:6E:66:E4:A9:B2:ED:4F:64
Info: Caching certificate for osmit-tre.osmit.eqiad.wmflabs
Info: Caching certificate_revocation_list for ca
Info: Caching certificate for ca
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for osmit-tre.osmit.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1543664329'
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root ]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root ]/ensure: removed
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc]/ensure: removed
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc/ssh]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc/ssh]/ensure: removed
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc/ssh/userkeys]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc/ssh/userkeys]/ensure: removed
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc/ssh/userkeys/root.d]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root /etc/ssh/userkeys/root.d]/ensure: removed
Notice: Applied catalog in 5.26 seconds

It was auto-signed successfully.

root@labpuppetmaster1001:~# puppet cert list --all | grep osmit-tre
+ "osmit-tre.osmit.eqiad.wmflabs"                                                  (SHA256) B1:18:15:7F:A6:F4:ED:71:B3:51:3B:07:CE:4A:29:F7:D7:78:83:10:5A:F1:3E:41:46:70:7A:47:6A:31:EB:10

@Andrew could this be a side effect of the project moving to eqiad1?

Same issue with osmit-tre and osm-serv instances. Fixed manually.

bd808 subscribed.

@Andrew could this be a side effect of the project moving to eqiad1?

I can't find the tickets, but I feel like I saw similar problems in a small number of instances that were moved several weeks ago.

It's hard to tell exactly what happened with this one now that it's fixed, but here's what I've been seeing:

  1. A few instances (probably build from a bad base image) run their 'firstboot' script every time, and fail to disable it after a successful run
  2. This firstboot script is out of date and fails to pick up the correct project name, setting it to '{' in resolv.conf
  3. That means that the hostname -f returns <name>.eqiad.wmflabs rather than <name>.<project>.eqiad.wmflabs
  4. Puppet certname is based on fqdn. If hostname -f returns the wrong fqdn then we wind up with an unsigned and unsignable puppet certname

The fix is to correct the project name in resolv.conf and re-run puppet, /and/ to remove /etc/rc.local to prevent the problem from recurring

...and now that I've said all that, I see that osmit-tre is a Trusty instance, and the above only ever happens on Jessie. So this remains a mystery :(

Could it be a timing issue where the VM is being moved by the reverse DNS isn't correctly yet, so Puppet takes that and runs with it (causing a new cert request to be made, etc)?

GTirloni triaged this task as Medium priority.