Page MenuHomePhabricator

Re-create ci slaves (April 2015)
Closed, ResolvedPublic

Description

  • integration-slave-precise-10xx and integration-trusty-10xx will be created.
  • integration-slave100x and integration-slave140x will be destroyed.

Event Timeline

Krinkle created this task.Apr 2 2015, 9:57 PM
Krinkle claimed this task.
Krinkle raised the priority of this task from to Normal.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2015, 9:57 PM
hashar added a subscriber: hashar.Apr 3 2015, 12:07 PM

I have recreated the integration-puppetmaster as Precise T94927: Downgrade intergration-puppetmaster back to Ubuntu Precise (re-create instance)

I have applied the puppet patches for:

Puppet is finishing up the installation of role::ci::slave::labs on the new precise instances integration-slave-precise-101[1-4].

Krinkle updated the task description. (Show Details)Apr 5 2015, 10:20 AM
Krinkle set Security to None.
Krinkle updated the task description. (Show Details)

All the new instances are failing to provision with the assigned role from the new puppet master.

SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]

Sample from integration-slave-trusty-1001:

Apr  5 11:04:11 integration-slave-trusty-1001 puppet-agent[7171]: Unable to fetch my node definition, but the agent run will continue:
Apr  5 11:04:11 integration-slave-trusty-1001 puppet-agent[7171]: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:04:11 integration-slave-trusty-1001 puppet-agent[7171]: Retrieving plugin
Apr  5 11:04:11 integration-slave-trusty-1001 puppet-agent[7171]: (/File[/var/lib/puppet/lib]) Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:04:11 integration-slave-trusty-1001 puppet-agent[7171]: (/File[/var/lib/puppet/lib]) Could not evaluate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost] Could not retrieve file metadata for puppet://integration-puppetmaster.eqiad.wmflabs/plugins: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:04:14 integration-slave-trusty-1001 puppet-agent[7171]: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
..
Apr  5 11:04:14 integration-slave-trusty-1001 puppet-agent[7171]: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:04:15 integration-slave-trusty-1001 puppet-agent[7171]: Using cached catalog
Apr  5 11:04:16 integration-slave-trusty-1001 puppet-agent[7171]: Applying configuration version '1428229359'
Apr  5 11:04:16 integration-slave-trusty-1001 puppet-agent[7171]: hostname: integration-slave-trusty-1001
Apr  5 11:04:16 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Role::Labs::Instance/Notify[hostname: integration-slave-trusty-1001]/message) defined 'message' as 'hostname: integration-slave-trusty-1001'
Apr  5 11:04:32 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Nrpe/Package[nagios-plugins-basic]/ensure) ensure changed 'purged' to 'present'
Apr  5 11:04:32 integration-slave-trusty-1001 puppet-agent[7171]: instanceproject: integration
Apr  5 11:04:32 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Role::Labs::Instance/Notify[instanceproject: integration]/message) defined 'message' as 'instanceproject: integration'
Apr  5 11:04:40 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ldap::Client::Nss/Package[nscd]/ensure) ensure changed '2.19-0ubuntu6.5' to '2.19-0ubuntu6.6'
Apr  5 11:04:46 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ldap::Client::Utils/Package[ldapvi]/ensure) ensure changed 'purged' to 'latest'
Apr  5 11:04:52 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Standard-packages/Package[tree]/ensure) ensure changed 'purged' to 'latest'
Apr  5 11:04:59 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Standard-packages/Package[ack-grep]/ensure) ensure changed 'purged' to 'latest'
Apr  5 11:05:01 integration-slave-trusty-1001 CRON[9668]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr  5 11:05:14 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Standard-packages/Package[linux-tools-3.13.0-45-generic]/ensure) ensure changed 'purged' to 'present'
Apr  5 11:05:22 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Role::Labs::Instance/Package[puppet-lint]/ensure) ensure changed 'purged' to 'present'
Apr  5 11:05:28 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Puppet/Package[virt-what]/ensure) ensure changed 'purged' to 'present'
..
Apr  5 11:05:43 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Role::Ntp/Ntp::Daemon[client]/Package[ntp]/ensure) ensure changed 'purged' to 'latest'
Apr  5 11:05:50 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Standard-packages/Package[linux-tools-generic]/ensure) ensure changed 'purged' to 'present'
Apr  5 11:05:57 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Monitoring::Host/Package[arcconf]/ensure) ensure changed 'purged' to 'latest'
Apr  5 11:05:57 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base/File[/etc/default/acct]) Could not evaluate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost] Could not retrieve file metadata for puppet:///modules/base/labs-acct.default: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:05:57 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ldap::Client::Utils/File[/usr/local/bin/ldaplist]) Could not evaluate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost] Could not retrieve file metadata for puppet:///modules/ldap/scripts/ldaplist: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:05:57 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ldap::Client::Pam/File[/etc/pam.d/common-auth]) Could not evaluate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost] Could not retrieve file metadata for puppet:///modules/ldap/common-auth: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:05:57 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Monitoring::Host/File[/usr/local/bin/check-raid.py]) Could not evaluate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost] Could not retrieve file metadata for puppet:///modules/base/monitoring/check-raid.py: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
..
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ldap::Client::Nss/Service[nscd]) Dependency File[/etc/nsswitch.conf] has failures: true
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ldap::Client::Nss/Service[nscd]) Skipping because of failed dependencies
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu]) Not removing directory; use 'force' to override
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu]/ensure) removed
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh]) Not removing directory; use 'force' to override
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh]/ensure) removed
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh/authorized_keys ]) Not removing directory; use 'force' to override
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh/authorized_keys ]/ensure) removed
..
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh/authorized_keys /public/keys/ubuntu/.ssh]) Not removing directory; use 'force' to override
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh/authorized_keys /public/keys/ubuntu/.ssh]/ensure) removed
Apr  5 11:07:31 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Passwords::Root/Ssh::Userkey[root]/File[/etc/ssh/userkeys/root]) Could not evaluate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost] Could not retrieve file metadata for puppet:///private/ssh/root-authorized-keys: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]
Apr  5 11:07:32 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Environment/Sysctl::Parameters[core_dumps]/Sysctl::Conffile[core_dumps]/File[/etc/sysctl.d/70-core_dumps.conf]) Dependency File[/etc/sysctl.d] has failures: true
Apr  5 11:07:32 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Base::Environment/Sysctl::Parameters[core_dumps]/Sysctl::Conffile[core_dumps]/File[/etc/sysctl.d/70-core_dumps.conf]) Skipping because of failed dependencies
Apr  5 11:07:32 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Sysctl/Exec[update_sysctl]) Dependency File[/etc/sysctl.d] has failures: true
Apr  5 11:07:32 integration-slave-trusty-1001 puppet-agent[7171]: (/Stage[main]/Sysctl/Exec[update_sysctl]) Skipping because of failed dependencies
Apr  5 11:07:32 integration-slave-trusty-1001 puppet-agent[7171]: Finished catalog run in 196.60 seconds
Apr  5 11:07:32 integration-slave-trusty-1001 puppet-agent[7171]: Could not send report: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: localhost]

The certificates errors reported above are due to a faulty puppet configuration change which switched the integration labs instance to a new DNS resolver. That one introduces in the fully qualified hostname the project name (ie: host.integration.eqiad.wmflabs) which mess up with the certs.

I have reverted the change by setting in the hiera configuration:

use_dnsmasq: true
use_dnsmasq_server: true

Then reinstalled puppetmaster and refreshed all certificates. See T95273: integration labs project DNS resolver improperly switched to openstack-designate for more details.

hashar added a comment.Apr 7 2015, 2:40 PM

I have deleted and recreated integration-slave-trusty-1005 because it has been provisioned with role::ci::website instead of role::ci::slave::labs

In creating integration-slave-trusty-1010 I ran into the following issues:

After the first two puppet runs, the base class (before applying role::ci::slave::labs) was still failing two install two packages packages:

..
Apr  8 18:25:46 integration-slave-trusty-1010 puppet-agent[1502]: E: Unable to locate package ldapvi
..
Apr  8 18:25:47 integration-slave-trusty-1010 puppet-agent[1502]: E: Unable to locate package ack-grep
..

After applying the role::ci::slave::labs role these eventually came in.

With the role applied and two more puppet runs, hhvm was still not installed for some reason. The log explains nothing about it:

Apr  8 19:01:11 integration-slave-trusty-1010 puppet-agent[15258]: hostname: integration-slave-trusty-1010
Apr  8 19:01:11 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Role::Labs::Instance/Notify[hostname: integration-slave-trusty-1010]/message) defined 'message' as 'hostname: integration-slave-trusty-1010'
..
Apr  8 19:01:18 integration-slave-trusty-1010 php5-curl: php5_invoke: Enable module curl for cli SAPI
Apr  8 19:01:56 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Contint::Packages::Labs/Package[hhvm-dev]/ensure) ensure changed 'purged' to 'present'
...

I eventually rebooted the instance and then suddenly the next puppet run did install it:

--
Apr  8 19:34:48 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Contint::Slave-scripts/Git::Clone[jenkins CI phpunit]/File[/srv/deployment/integration/phpunit]/ensure) created
..
Apr  8 19:34:48 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Contint::Slave-scripts/Git::Clone[jenkins CI 
Composer]/Exec[git_clone_jenkins CI Composer]/returns) executed successfully
Apr  8 19:34:55 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/File[/etc/hhvm]/ensure) created
Apr  8 19:34:55 integration-slave-trusty-1010 crontab[2974]: (root) LIST (root)
Apr  8 19:34:55 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/Cron[tidy_perf_maps]/ensure) created
Apr  8 19:34:56 integration-slave-trusty-1010 crontab[2975]: (root) REPLACE (root)
Apr  8 19:35:01 integration-slave-trusty-1010 CRON[3400]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr  8 19:35:02 integration-slave-trusty-1010 kernel: [  149.184636] init: hhvm main process (3438) terminated with status 1
Apr  8 19:35:02 integration-slave-trusty-1010 kernel: [  149.184649] init: hhvm main process ended, respawning
Apr  8 19:35:03 integration-slave-trusty-1010 kernel: [  149.265894] init: hhvm main process (3456) terminated with status 1
..

Apr  8 19:35:03 integration-slave-trusty-1010 kernel: [  149.980606] init: hhvm main process ended, respawning
Apr  8 19:35:03 integration-slave-trusty-1010 kernel: [  150.060870] init: hhvm main process (3531) terminated with status 1
Apr  8 19:35:03 integration-slave-trusty-1010 kernel: [  150.060885] init: hhvm respawning too fast, stopped
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/Package[hhvm]/ensure) ensure changed 'purged' to 'present'
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/Package[hhvm]) Scheduling refresh of Service[hhvm]
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/File[/var/log/hhvm]/owner) owner changed 'www-data' to 'syslog'
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/File[/var/log/hhvm]/mode) mode changed '0755' to '0775'
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/File[/etc/hhvm/fcgi.ini]/ensure) defined content as '{md5}0838f8686ca59af6cbe599c6ef53a60f'
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/File[/etc/hhvm/fcgi.ini]) Scheduling refresh of Service[hhvm]
Apr  8 19:35:05 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Hhvm/File[/etc/default/hhvm]/content) 
...

And now that the instance is rebooted, I noticed that Zuul still hasn't been installed.

Apr  8 19:03:13 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Zuul::User/User[zuul]/ensure) created
...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install zuul' returned 100: Reading package lists...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: Building dependency tree...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: Reading state information...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: E: Unable to locate package zuul
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Zuul/Package[zuul]/ensure) change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install zuul' returned 100: Reading package lists...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Zuul/Package[zuul]/ensure) Building dependency tree...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Zuul/Package[zuul]/ensure) Reading state information...
Apr  8 19:18:24 integration-slave-trusty-1010 puppet-agent[15258]: (/Stage[main]/Zuul/Package[zuul]/ensure) E: Unable to locate package zuul
...
Apr  8 19:34:25 integration-slave-trusty-1010 nslcd[1194]: [52255a] <passwd="jenkins-deploy"> (re)loading /etc/nsswitch.conf
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install zuul' returned 100: Reading package lists...
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: Building dependency tree...
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: Reading state information...
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: E: Unable to locate package zuul
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Zuul/Package[zuul]/ensure) change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install zuul' returned 100: Reading package lists...
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Zuul/Package[zuul]/ensure) Building dependency tree...
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Zuul/Package[zuul]/ensure) Reading state information...
Apr  8 19:34:26 integration-slave-trusty-1010 puppet-agent[1558]: (/Stage[main]/Zuul/Package[zuul]/ensure) E: Unable to locate package zuul
Apr  8 19:34:32 integration-slave-trusty-1010 gdnsd[2152]: No config file at '/etc/gdnsd/config', using defaults
...
hashar added a comment.Apr 8 2015, 9:10 PM

(/Stage[main]/Zuul/Package[zuul]/ensure) E: Unable to locate package zuul

I have switched the instances to the Zuul Debian package via cherry picked patch https://gerrit.wikimedia.org/r/#/c/202714/ . In short that removes the git::clone and python setup.py install in favor of installing the package.

The package hasn't been uploaded to apt.wikimedia.org . There is a Precise and a Trusty version of it in my homedir on integration.

Will refine tomorrow.

(/Stage[main]/Zuul/Package[zuul]/ensure) E: Unable to locate package zuul

I have switched the instances to the Zuul Debian package via cherry picked patch https://gerrit.wikimedia.org/r/#/c/202714/ . In short that removes the git::clone and python setup.py install in favor of installing the package.
The package hasn't been uploaded to apt.wikimedia.org . There is a Precise and a Trusty version of it in my homedir on integration.
Will refine tomorrow.

Thanks. The new trusty instances are now up and running. Precise next :)

Krinkle updated the task description. (Show Details)Apr 10 2015, 1:52 PM

The new integration-slave-precise-10xx instances have been successfully provisioned and are now pooled.

Krinkle closed this task as Resolved.Apr 10 2015, 1:53 PM
Krinkle moved this task from In-progress to Done on the Continuous-Integration-Infrastructure board.