Page MenuHomePhabricator

Puppet failure on instance creation
Closed, ResolvedPublic

Description

I was attempting to add two new instances (utrs-live, and utrs-database) to our project (for T143349) and puppet seems to be failing to assign SSL keys to the instance, failing due to a connection refusal, and stopping there. The two outputs are relatively the same and therefore I'm only including the one.

* Stopping Send an event to indicate plymouth is up[74G[ OK ]
 * Starting Mount filesystems on boot[74G[ OK ]
 * Starting Populate /dev filesystem[74G[ OK ]
 * Starting Populate and link to /run filesystem[74G[ OK ]
 * Stopping Populate /dev filesystem[74G[ OK ]
 * Stopping Populate and link to /run filesystem[74G[ OK ]
 * Stopping Track if upstart is running in a container[74G[ OK ]
 * Starting Initialize or finalize resolvconf[74G[ OK ]
 * Starting Signal sysvinit that the rootfs is mounted[74G[ OK ]
 * Starting Clean /tmp directory[74G[ OK ]
 * Stopping Clean /tmp directory[74G[ OK ]
Cloud-init v. 0.7.5 running 'init-local' at Wed, 25 Jan 2017 19:17:30 +0000. Up 2.37 seconds.
cloud-init-nonet[2.64]: waiting 10 seconds for network device
 * Starting set console keymap[74G[ OK ]
 * Starting Signal sysvinit that virtual filesystems are mounted[74G[ OK ]
 * Starting Signal sysvinit that virtual filesystems are mounted[74G[ OK ]
 * Starting Bridge udev events into upstart[74G[ OK ]
 * Starting Signal sysvinit that remote filesystems are mounted[74G[ OK ]
 * Starting device node and kernel event manager[74G[ OK ]
 * Stopping set console keymap[74G[ OK ]
 * Starting load modules from /etc/modules[74G[ OK ]
 * Starting cold plug devices[74G[ OK ]
 * Starting log initial device creation[74G[ OK ]
 * Stopping load modules from /etc/modules[74G[ OK ]
 * Starting Uncomplicated firewall[74G[ OK ]
 * Starting configure network device security[74G[ OK ]
 * Starting configure network device security[74G[ OK ]
 * Starting Mount network filesystems[74G[ OK ]
 * Starting Upstart job to start rpcbind on boot only[74G[ OK ]
 * Stopping Upstart job to start rpcbind on boot only[74G[ OK ]
 * Stopping Mount network filesystems[74G[ OK ]
 * Starting RPC portmapper replacement[74G[ OK ]
 * Starting LLDP daemon[74G[ OK ]
 * Starting NSM status monitor[74G[ OK ]
 * Starting configure network device[74G[ OK ]
 * Starting Bridge socket events into upstart[74G[ OK ]
 * Starting Mount network filesystems[74G[ OK ]
cloud-init-nonet[4.36]: static networking is now up
 * Starting configure network device[74G[ OK ]
 * Stopping cold plug devices[74G[ OK ]
 * Stopping log initial device creation[74G[ OK ]
 * Starting set console font[74G[ OK ]
 * Stopping Mount network filesystems[74G[ OK ]
 * Stopping set console font[74G[ OK ]
 * Starting userspace bootsplash[74G[ OK ]
Cloud-init v. 0.7.5 running 'init' at Wed, 25 Jan 2017 19:17:32 +0000. Up 4.70 seconds.
 * Starting Send an event to indicate plymouth is up[74G[ OK ]
ci-info: ++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++
ci-info: +--------+------+-------------+---------------+-------------------+
ci-info: | Device |  Up  |   Address   |      Mask     |     Hw-Address    |
ci-info: +--------+------+-------------+---------------+-------------------+
ci-info: |   lo   | True |  127.0.0.1  |   255.0.0.0   |         .         |
ci-info: |  eth0  | True | 10.68.18.83 | 255.255.248.0 | fa:16:3e:be:40:7c |
ci-info: +--------+------+-------------+---------------+-------------------+
ci-info: +++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: | Route | Destination |  Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 10.68.16.1 |    0.0.0.0    |    eth0   |   UG  |
ci-info: |   1   |  10.68.16.0 |  0.0.0.0   | 255.255.248.0 |    eth0   |   U   |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
 * Stopping Send an event to indicate plymouth is up[74G[ OK ]
 * Stopping userspace bootsplash[74G[ OK ]
2017-01-25 19:17:36,998 - __init__.py[WARNING]: Format for 'users' key must be a comma separated string or a dictionary or a list and not NoneType
2017-01-25 19:17:37,393 - __init__.py[WARNING]: Format for 'users' key must be a comma separated string or a dictionary or a list and not NoneType
Generating public/private rsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
The key fingerprint is:
6f:e2:09:75:cf:ad:d2:9d:5c:32:b1:ca:f5:9d:46:5a root@utrs-live
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|                 |
|              .  |
|        S .    o |
|       . o o .=E.|
|      . . ooo==*o|
|       o +. +o+oo|
|        o  .. .  |
+-----------------+
Generating public/private dsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_dsa_key.
Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub.
The key fingerprint is:
c1:05:65:0a:17:f3:1a:31:9d:9e:13:d5:f5:c2:a8:51 root@utrs-live
The key's randomart image is:
+--[ DSA 1024]----+
|      . B=+oE. ..|
|       + O+. o. .|
|        =.oo. o .|
|         ++o   . |
|        S ..     |
|                 |
|                 |
|                 |
|                 |
+-----------------+
Generating public/private ecdsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key.
Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub.
The key fingerprint is:
01:65:c3:7a:83:29:f1:f2:14:7e:ff:1f:47:0c:f7:e5 root@utrs-live
The key's randomart image is:
+--[ECDSA  256]---+
|      .o+        |
|    . .o..       |
|     + =.     . o|
|    o B +.     =o|
|     = oSo      E|
|      .   .    . |
|           .  . .|
|            .  o |
|             ..  |
+-----------------+
Generating public/private ed25519 key pair.
Your identification has been saved in /etc/ssh/ssh_host_ed25519_key.
Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
The key fingerprint is:
5d:f2:52:f4:6c:98:60:e9:a3:b0:fa:0e:7b:08:2b:62 root@utrs-live
The key's randomart image is:
+--[ED25519  256--+
|          o..    |
|         ..o =   |
|         .. = +  |
|      .  .o= .   |
|       oS.o..    |
|  .   . .  .     |
|   o.o           |
|oE. oo.          |
|o.  .+o          |
+-----------------+
 * Starting Signal sysvinit that local filesystems are mounted[74G[ OK ]
 * Starting configure network device security[74G[ OK ]
 * Stopping Mount filesystems on boot[74G[ OK ]
 * Starting flush early job output to logs[74G[ OK ]
 * Stopping Failsafe Boot Delay[74G[ OK ]
 * Starting NFSv4 id <-> name mapper[74G[ OK ]
 * Starting System V initialisation compatibility[74G[ OK ]
 * Stopping flush early job output to logs[74G[ OK ]
 * Starting configure virtual network devices[74G[ OK ]
 * Starting D-Bus system message bus[74G[ OK ]
 * Stopping rpcsec_gss daemon[74G[ OK ]
 * Starting SystemD login management service[74G[ OK ]
 * Starting system logging daemon[74G[ OK ]
 * Starting Handle applying cloud-config[74G[ OK ]
 * Starting Bridge file events into upstart[74G[ OK ]
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
 * Starting AppArmor profiles       [80G [74G[ OK ]
Cloud-init v. 0.7.5 running 'modules:config' at Wed, 25 Jan 2017 19:17:38 +0000. Up 10.18 seconds.
 * Setting up X socket directories...       [80G [74G[ OK ]
 * Stopping System V initialisation compatibility[74G[ OK ]
 * Starting System V runlevel compatibility[74G[ OK ]
 * Starting Salt Minion[74G[ OK ]
 * Starting save kernel messages[74G[ OK ]
 * Starting regular background program processing daemon[74G[ OK ]
 * Starting CPU interrupts balancing daemon[74G[ OK ]
 * Stopping save kernel messages[74G[ OK ]
 * Starting OpenSSH server[74G[ OK ]
Turning on process accounting, file set to '/var/log/account/pacct'.
 * Done.
 * Starting MTA       [80G [74G[ OK ]
Generating locales...
 * Starting Name Service Cache Daemon nscd       [80G [74G[ OK ]
  en_US.UTF-8... hostname: Name or service not known
up-to-date
Generation complete.
 * Starting LDAP connection daemon nslcd       [80G [74G[ OK ]
 * Stopping Handle applying cloud-config[74G[ OK ]
 * Starting puppet agent       [80G [74G[ OK ]
 * Restoring resolver state...       [80G [74G[ OK ]
 * Stopping System V runlevel compatibility[74G[ OK ]
Cloud-init v. 0.7.5 running 'modules:final' at Wed, 25 Jan 2017 19:17:44 +0000. Up 16.70 seconds.
+ echo 'Enabling console logging for puppet while it does the initial run'
Enabling console logging for puppet while it does the initial run
+ echo 'daemon.* |/dev/console'
+ restart rsyslog
rsyslog start/running, process 1284
+ /sbin/vgdisplay -c vd
File descriptor 3 (socket:[11271]) leaked on vgdisplay invocation. Parent PID 1270: /bin/bash
  Volume group "vd" not found
+ echo 'Creating the volume group'
Creating the volume group
+ /sbin/parted -s /dev/vda print
Model: Virtio Block Device (virtblk)
Disk /dev/vda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      32.3kB  19.5GB  19.5GB  primary  ext4
 2      19.5GB  20.0GB  512MB   primary  linux-swap(v1)

+ /sbin/parted -ms /dev/vda print
BYT;
/dev/vda:42.9GB:virtblk:512:512:msdos:Virtio Block Device;
1:32.3kB:19.5GB:19.5GB:ext4::;
2:19.5GB:20.0GB:512MB:linux-swap(v1)::;
+ /sbin/parted -s /dev/vda print free
Model: Virtio Block Device (virtblk)
Disk /dev/vda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      32.3kB  19.5GB  19.5GB  primary  ext4
        19.5GB  19.5GB  279kB            Free Space
 2      19.5GB  20.0GB  512MB   primary  linux-swap(v1)
        20.0GB  42.9GB  23.0GB           Free Space

+ /sbin/parted -ms /dev/vda print free
BYT;
/dev/vda:42.9GB:virtblk:512:512:msdos:Virtio Block Device;
1:32.3kB:19.5GB:19.5GB:ext4::;
1:19.5GB:19.5GB:279kB:free;
2:19.5GB:20.0GB:512MB:linux-swap(v1)::;
1:20.0GB:42.9GB:23.0GB:free;
++ /sbin/parted -ms /dev/vda print free
++ /usr/bin/cut -d : -f 2,3 '--output-delimiter= '
++ /usr/bin/tail -n 1
+ /sbin/parted -s /dev/vda mkpart primary 20.0GB 42.9GB
++ /sbin/parted -ms /dev/vda print
++ /usr/bin/cut -d : -f 1
++ /usr/bin/tail -n 1
+ part=3
+ '[' 3 '!=' '' ']'
+ '[' 3 -gt 1 ']'
+ /sbin/parted -s /dev/vda set 3 lvm on
+ /sbin/pvcreate /dev/vda3
File descriptor 3 (socket:[11271]) leaked on pvcreate invocation. Parent PID 1270: /bin/bash
  Physical volume "/dev/vda3" successfully created
+ /sbin/vgcreate vd /dev/vda3
File descriptor 3 (socket:[11271]) leaked on vgcreate invocation. Parent PID 1270: /bin/bash
  Volume group "vd" successfully created
+ /sbin/partprobe
++ curl http://169.254.169.254/openstack/latest/meta_data.json/
++ sed -r 's/^.*project_id\": \"//'
++ sed -r 's/\".*$//g'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   881  100   881    0     0   113k      0 --:--:-- --:--:-- --:--:--  122k
+ project=utrs
++ curl http://169.254.169.254/1.0/meta-data/local-ipv4
+ ip=10.68.18.83
++ hostname
+ hostname=utrs-live
++ hostname -d
++ sed -r 's/.*\.([^.]+\.[^.]+)$/\1/'
hostname: Name or service not known
+ domain=
+ fqdn=utrs-live.utrs.
+ saltfinger=c5:b1:35:45:3e:0a:19:70:aa:5f:3a:cf:bf:a0:61:dd
+ '[' '' == eqiad.wmflabs ']'
+ '[' '' == codfw.wmflabs ']'
+ sed -i s/_PROJECT_/utrs/g /etc/security/access.conf
+ sed -i s/_PROJECT_/utrs/g /etc/ldap/ldap.conf
+ sed -i s/_PROJECT_/utrs/g /etc/sudo-ldap.conf
+ sed -i s/_PROJECT_/utrs/g /etc/nslcd.conf
+ sed -i s/_FQDN_/utrs-live.utrs./g /etc/puppet/puppet.conf
+ sed -i s/_MASTER_//g /etc/puppet/puppet.conf
+ echo ''
+ mkdir /etc/dhcp/dhclient-enter-hooks.d
mkdir: cannot create directory '/etc/dhcp/dhclient-enter-hooks.d': File exists
+ cat
++ /usr/bin/dig +short labs-recursor0.wikimedia.org
+ nameserver=208.80.155.118
+ cat
+ echo '10.68.18.83	utrs-live.utrs.'
+ /etc/init.d/nslcd restart
hostname: Name or service not known
 * Restarting LDAP connection daemon nslcd
   ...done.
+ /etc/init.d/nscd restart
 * Restarting Name Service Cache Daemon nscd
   ...done.
+ dpkg-reconfigure -fnoninteractive -pcritical openssh-server
ssh stop/waiting
ssh start/running, process 1461
+ /etc/init.d/ssh stop
ssh stop/waiting
+ /etc/init.d/ssh start
ssh start/running, process 1489
+ nscd -i hosts
+ echo utrs-live.utrs.
+ echo -e 'master: \n'
+ echo 'id: utrs-live.utrs.'
+ echo 'master_finger: c5:b1:35:45:3e:0a:19:70:aa:5f:3a:cf:bf:a0:61:dd'
+ echo utrs-live.utrs.
+ /etc/init.d/salt-minion restart
salt-minion stop/waiting
salt-minion start/running, process 1500
+ puppet agent --enable
+ puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=10 --certname=utrs-live.utrs. --server=
[0;32mInfo: Creating a new SSL key for utrs-live.utrs.[0m
[1;31mError: Could not request certificate: Connection refused - connect(2)[0m

and repeats that last line over and over and over again

Event Timeline

The key bit here seems to be Could not request certificate: Connection refused when trying to talk to the Puppetmaster. The + sed -i s/_MASTER_//g /etc/puppet/puppet.conf line looks suspicious too, but I don't know enough about the first boot configuration to understand if this is normal or not.

I tried to reproduce by creating a new instance and see a big difference in that initial setup:

+ project=mediawiki-vagrant
++ curl http://169.254.169.254/1.0/meta-data/local-ipv4
+ ip=10.68.19.67
++ hostname
+ hostname=t156297-repro-test
++ sed -r 's/.*\.([^.]+\.[^.]+)$/\1/'
++ hostname -d
+ domain=eqiad.wmflabs
+ fqdn=t156297-repro-test.mediawiki-vagrant.eqiad.wmflabs
+ saltfinger=c5:b1:35:45:3e:0a:19:70:aa:5f:3a:cf:bf:a0:61:dd
+ '[' eqiad.wmflabs == eqiad.wmflabs ']'
+ master=labs-puppetmaster-eqiad.wikimedia.org
+ '[' eqiad.wmflabs == codfw.wmflabs ']'
+ sed -i s/_PROJECT_/mediawiki-vagrant/g /etc/security/access.conf
+ sed -i s/_PROJECT_/mediawiki-vagrant/g /etc/ldap/ldap.conf
+ sed -i s/_PROJECT_/mediawiki-vagrant/g /etc/sudo-ldap.conf
+ sed -i s/_PROJECT_/mediawiki-vagrant/g /etc/nslcd.conf
+ sed -i s/_FQDN_/t156297-repro-test.mediawiki-vagrant.eqiad.wmflabs/g /etc/puppet/puppet.conf
+ sed -i s/_MASTER_/labs-puppetmaster-eqiad.wikimedia.org/g

I'm not sure what exactly failed in the bootstrapping, but the equivalent log lines in the failure case are:

+ project=utrs
++ curl http://169.254.169.254/1.0/meta-data/local-ipv4
+ ip=10.68.18.83
++ hostname
+ hostname=utrs-live
++ hostname -d
++ sed -r 's/.*\.([^.]+\.[^.]+)$/\1/'
hostname: Name or service not known
+ domain=
+ fqdn=utrs-live.utrs.
+ saltfinger=c5:b1:35:45:3e:0a:19:70:aa:5f:3a:cf:bf:a0:61:dd
+ '[' '' == eqiad.wmflabs ']'
+ '[' '' == codfw.wmflabs ']'
+ sed -i s/_PROJECT_/utrs/g /etc/security/access.conf
+ sed -i s/_PROJECT_/utrs/g /etc/ldap/ldap.conf
+ sed -i s/_PROJECT_/utrs/g /etc/sudo-ldap.conf
+ sed -i s/_PROJECT_/utrs/g /etc/nslcd.conf
+ sed -i s/_FQDN_/utrs-live.utrs./g /etc/puppet/puppet.conf
+ sed -i s/_MASTER_//g /etc/puppet/puppet.conf

Does this project use its own puppetmaster?

As mentioned on IRC, I have no clue. The option was not presented in the setup options as far as I know.

@DeltaQuad ok thank you, @Andrew can you look at this when you get a second? I'm not sure what's going on at the moment in this project.

Change 334224 had a related patch set uploaded (by Andrew Bogott):
Keystone: Assign more threads to the admin API uwsgi handler

https://gerrit.wikimedia.org/r/334224

The interesting part in that log is this:

hostname: Name or service not known

That means that the instance didn't get a DNS record. That seems to be because designate is asking keystone to resolve some things and keystone is refusing the connection. This is all very fragile, but the attached patch may get us around the 'keystone too busy' issue at least.

In any case, this is most likely an intermittent error, so retrying is always a good approach.

Change 334224 merged by Andrew Bogott:
Keystone: Assign more threads to the admin API uwsgi handler

https://gerrit.wikimedia.org/r/334224

Change 334228 had a related patch set uploaded (by Andrew Bogott):
Designate sink: Don't use keystone to resolve project_id

https://gerrit.wikimedia.org/r/334228

https://gerrit.wikimedia.org/r/#/c/334224/ (add more threads) doesn't seem to help much... something must be leaking connections. I'll have to dig deeper into the uwsgi setup to see why this is so touchy.

Change 334228 merged by Andrew Bogott:
Designate sink: Don't use keystone to resolve project_id

https://gerrit.wikimedia.org/r/334228

Andrew claimed this task.

334228 should resolve this issue (although not the general keystone performance problem.) Delete and recreate your instance and all should be well.

Mentioned in SAL (#wikimedia-labs) [2017-01-26T05:19:07Z] <AmandaNP> delete utrs-live and utrs-database per T156297

@Andrew It appears recreation was successful, although the number of notices on the console output is crazy. Is that to be expected?

I'm not sure I know what you mean by the 'console output'. If you're talking about the system log then, yes, it always tells you a lot.