Page MenuHomePhabricator

Puppet times out on newly created instance in the puppet-diffs project
Closed, DeclinedPublic

Description

I have created a new instance compiler1003.puppet-diffs.eqiad.wmflabs and eventually deleted it after 35 minutes of puppet running on it.

I created it a second time and it is still slow https://horizon.wikimedia.org/project/instances/7ee33824-55f0-4c80-ba7f-c317bea301a3/ . The console shows errors such as :

/Stage[main]/Base::Environment/File[/etc/profile.d/field.sh]:
Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/environment/field.sh: execution expired

Sounds like an issue on the WMCS puppetmaster?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2019, 1:43 PM

Eventually the instance console has shown the login prompt after 2300 seconds (38 minutes). Running puppet again:

Notice: Applied catalog in 7.76 seconds
Andrew renamed this task from Puppet times out on newly created instance to Puppet times out on newly created instance in the puppet-diffs project.Jul 15 2019, 3:03 PM
Andrew added a subscriber: Andrew.Jul 15 2019, 3:11 PM

Note that puppet installs a bunch of files happily before getting to this point. So it's not a total failure of file-serving, it's something specific.

That said, it's also not /just/ that file that fails, but seemingly anything after that. So maybe it's something about big catalogs causing a timeout someplace...

Mentioned in SAL (#wikimedia-cloud) [2019-07-15T20:14:37Z] <andrewbogott> moved project-wide hiera settings into a 'compiler' prefix so I can test the project without implicit hiera settings for T228056

Thanks to an extremely tedious binary search, I've determined that this is not a puppetmaster issue. It correlates with this hiera setting:

base::resolving::labs_additional_domains:
- eqiad.wmflabs
- wikimedia.org
- eqiad.wmnet
- codfw.wmnet
- esams.wmnet
- ulsfo.wmnet eqsin.wmnet

disappointingly, fixing that mistaken newline issue doesn't resolve anything

So... @hashar is the work around for this just to wait a while after building a new compiler node? Or is that not adequate?

Ah indeed on compiler1002.puppet-diffs.eqiad.wmflabs:

/etc/resolv.conf
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain puppet-diffs.eqiad.wmflabs
search puppet-diffs.eqiad.wmflabs eqiad.wmflabs eqiad.wmflabs wikimedia.org eqiad.wmnet codfw.wmnet esams.wmnet ulsfo.wmnet eqsin.wmnet
nameserver 208.80.154.143
nameserver 208.80.154.24
options timeout:2 ndots:1

If the huge search setting is the cause of slowdown, I am tempted to just remove it. Having to wait 40 minutes for the initial puppet run is just crazy :] The setting comes from https://horizon.wikimedia.org/project/puppet/

base::resolving::labs_additional_domains:
- wikimedia.org
- eqiad.wmnet
- codfw.wmnet
- esams.wmnet
- ulsfo.wmnet
- eqsin.wmnet

Maybe there is a reason for it such as maybe looking up a hostname using their base hostname and relying on DNS to lookup the FQDN used as the puppet node name.

Looking at compiler1004, it seems the metadata retrieval done by puppet has a 1 minute timeout of some sort:

Jul 15 20:40:30 host-172-16-3-150 puppet-agent[8052]: (/Stage[main]/Base::Environment/File[/usr/local/bin/gen_fingerprints]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/enviro
nment/gen_fingerprints: execution expired

Jul 15 20:40:30 host-172-16-3-150 puppet-agent[8052]: (/Stage[main]/Base::Environment/File[/var/tmp/core]/ensure) created

<1 minute sleep>

Jul 15 20:41:30 host-172-16-3-150 puppet-agent[8052]: (/Stage[main]/Base::Environment/File[/etc/vim/vimrc.local]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/environment/vimrc
.local: execution expired

Jul 15 20:41:30 host-172-16-3-150 puppet-agent[8052]: (/Stage[main]/Base::Phaste/File[/etc/phaste.conf]/ensure) defined content as '{md5}8ae7edf0bb31e20441ff3b8e053e1644'

<1 minute sleep>

Jul 15 20:42:30 host-172-16-3-150 rc.local[400]: #033[1;31mError: /Stage[main]/Base::Phaste/File[/usr/local/bin/phaste]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/phaste.py:
execution expired#033[0m

etc..

The first occurrence is at 20:39:30 which happens just after resolv.conf got updated (at 20:38:05). Eventually later other files timesout after 40 seconds or 13 seconds. So that is hmm a mystery.


In puppet 4.8.2, the error message comes from line 194 below:

lib/puppet/type/file/source.rb
 10 module Puppet
 11   # Copy files from a local or remote source. ...
...
 16   Puppet::Type.type(:file).newparam(:source) do
...
168     # Provide, and retrieve if necessary, the metadata for this file.  Fail
169     # if we can't find data about this host, and fail if there are any
170     # problems in our query.
171     def metadata
172       return @metadata if @metadata
173 
174       if @metadata = resource.catalog.metadata[resource.title]
175         return @metadata
176       end
177 
178       return nil unless value
179       value.each do |source|
180         begin
181           options = {
182             :environment          => resource.catalog.environment_instance,
183             :links                => resource[:links],
184             :checksum_type        => resource[:checksum],
185             :source_permissions   => resource[:source_permissions]
186           }
187 
188           if data = Puppet::FileServing::Metadata.indirection.find(source, options)
189             @metadata = data
190             @metadata.source = source
191             break
192           end
193         rescue => detail
194           self.fail Puppet::Error, "Could not retrieve file metadata for #{source}: #{detail}", detail
195         end
196       end
197       self.fail "Could not retrieve information from environment #{resource.catalog.environment} source(s) #{value.join(", ")}" unless @metadata
198       @metadata
199     end

So I guess lets just unset base::resolving::labs_additional_domains ?

So I guess lets just unset base::resolving::labs_additional_domains ?

That got added by @Joe for the compiler:

59d807a240f45f388d1391e2b1fd394294ea897f
ad05f15d154293c5fd96969e654bf439eafb2d70

So I guess it is required :-\

hashar closed this task as Declined.Sep 17 2019, 1:11 PM

I could not figure it out, so I guess we just have to wait for a while on the initial puppet run. After that the instance seems to behave properly.