@Papaul noticed that in recent reimages in codfw the time for the first puppet run has increased a lot.
Having a look at one reimage from today (es2028) I found a possible culprit, but i can't investigate further right now.
From Puppetboard (search for Connection timed out) we have:
2020-09-10T18:26:04.309+00:00 /Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs] info, ferm, file, ferm::conf, conf, defs, class, base::firewall, base, firewall, profile::base::firewall, profile, role::spare::system, role, spare, system Scheduling refresh of Service[ferm] /etc/puppet/modules/ferm/manifests/conf.pp:14 2020-09-10T18:41:47.467+00:00 /Stage[main]/Base::Firewall/Ferm::Conf[main]/File[/etc/ferm/conf.d/02_main] err, ferm, file, ferm::conf, conf, main, class, base::firewall, base, firewall, profile::base::firewall, profile, role::spare::system, role, spare, system Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/firewall/main-input-default-drop.conf: Connection timed out /etc/puppet/modules/ferm/manifests/conf.pp:14
The failed step (the second) took 15 minutes to fail after the previous one.
The related apache log line on puppetmaster2001 (frontend) is:
2020-09-10T18:26:04 33054 2620:0:860:101:be97:e1ff:fe57:da84 proxy-server/200 259 GET http://puppet:8140/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?environment=production&links=manage&checksum_type=md5&source_permissions=ignore - application/json - - Puppet/5.5.10 Ruby/2.5.5-p157 (x86_64-linux-gnu) - - - - 2620:0:860:101:be97:e1ff:fe57:da84 - -
Note the automatic IPv6 address.
And the same on puppetmaster2003 (backend) should be:
2020-09-10T18:26:04 32184 2620:0:860:101:10:192:0:27 -/200 259 GET http://puppetmaster2003.codfw.wmnet:8141/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?environment=production&links=manage&checksum_type=md5&source_permissions=ignore - application/json - 2620:0:860:101:be97:e1ff:fe57:da84 Puppet/5.5.10 Ruby/2.5.5-p157 (x86_64-linux-gnu) - - - - 2620:0:860:101:10:192:0:27 - -
That's the last request with that IP, the ones after are made with the mapped IPv6 2620:0:860:101:10:192:0:140.
On es2028 we have:
Sep 10 18:26:04 es2028 puppet-agent[1196]: (/Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Augeas[eno3_v6_token]/returns) executed successfully Sep 10 18:26:04 es2028 puppet-agent[1196]: (/Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Exec[eno3_v6_token]/returns) executed successfully
That matches perfectly and explains the IP change. My understanding is that the application of the mapped IP prevents the following request to the puppetmaster to succeed and there is a 15 minutes timeout that pauses the whole run.
@jbond maybe you have more context and/or this has already been investigated in similar cases?