Page MenuHomePhabricator

First puppet run after reimage slow (connection timeout)
Open, MediumPublic

Description

@Papaul noticed that in recent reimages in codfw the time for the first puppet run has increased a lot.

Having a look at one reimage from today (es2028) I found a possible culprit, but i can't investigate further right now.

From Puppetboard (search for Connection timed out) we have:

2020-09-10T18:26:04.309+00:00	/Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]	info, ferm, file, ferm::conf, conf, defs, class, base::firewall, base, firewall, profile::base::firewall, profile, role::spare::system, role, spare, system	Scheduling refresh of Service[ferm]	/etc/puppet/modules/ferm/manifests/conf.pp:14
2020-09-10T18:41:47.467+00:00	/Stage[main]/Base::Firewall/Ferm::Conf[main]/File[/etc/ferm/conf.d/02_main]	err, ferm, file, ferm::conf, conf, main, class, base::firewall, base, firewall, profile::base::firewall, profile, role::spare::system, role, spare, system	Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/firewall/main-input-default-drop.conf: Connection timed out	/etc/puppet/modules/ferm/manifests/conf.pp:14

The failed step (the second) took 15 minutes to fail after the previous one.

The related apache log line on puppetmaster2001 (frontend) is:

2020-09-10T18:26:04     33054   2620:0:860:101:be97:e1ff:fe57:da84      proxy-server/200        259     GET     http://puppet:8140/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?environment=production&links=manage&checksum_type=md5&source_permissions=ignore   -       application/json        -       -       Puppet/5.5.10 Ruby/2.5.5-p157 (x86_64-linux-gnu)        -       -       -       -       2620:0:860:101:be97:e1ff:fe57:da84      -       -

Note the automatic IPv6 address.

And the same on puppetmaster2003 (backend) should be:

2020-09-10T18:26:04     32184   2620:0:860:101:10:192:0:27      -/200   259     GET     http://puppetmaster2003.codfw.wmnet:8141/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?environment=production&links=manage&checksum_type=md5&source_permissions=ignore     -       application/json        -       2620:0:860:101:be97:e1ff:fe57:da84      Puppet/5.5.10 Ruby/2.5.5-p157 (x86_64-linux-gnu)        -       -       -       -       2620:0:860:101:10:192:0:27      -
       -

That's the last request with that IP, the ones after are made with the mapped IPv6 2620:0:860:101:10:192:0:140.

On es2028 we have:

Sep 10 18:26:04 es2028 puppet-agent[1196]: (/Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Augeas[eno3_v6_token]/returns) executed successfully
Sep 10 18:26:04 es2028 puppet-agent[1196]: (/Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Exec[eno3_v6_token]/returns) executed successfully

That matches perfectly and explains the IP change. My understanding is that the application of the mapped IP prevents the following request to the puppetmaster to succeed and there is a 15 minutes timeout that pauses the whole run.
@jbond maybe you have more context and/or this has already been investigated in similar cases?

Event Timeline

Volans triaged this task as Medium priority.Sep 10 2020, 8:32 PM
Volans created this task.

From what I can tell that's a known issue/longstanding behaviour with enabling the mapped addressed, the puppet run stalls/times out with the "ifup" call with does the actual address change. When enabling it manually on servers I usually CRTL-Ced the ifup command and re-run puppet (which fixes it), but non-interactively it simply runs into the timeout. I doubt there's much to be done as long as we use the classic ipupdown networking, though. That's one of the many perks we'll have when we eventually ditch it in favour of systemd-networkd.

@MoritzMuehlenhoff would forcing the first puppet run to use the v4 address that is not changing an acceptable workaround?
I did a quick try and using --sourceaddress $IPv4 the apache logs on the puppetmaster reports the v4 address.

That's a good idea, AFAICT that should reliably bypass the issue!

Change 626553 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: first Puppet run, force IPv4

https://gerrit.wikimedia.org/r/626553

Change 626553 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: first Puppet run, force IPv4

https://gerrit.wikimedia.org/r/626553

From what I can tell that's a known issue/longstanding behaviour with enabling the mapped addressed, the puppet run stalls/times out with the "ifup" call with does the actual address change. When enabling it manually on servers I usually CRTL-Ced the ifup command and re-run puppet (which fixes it), but non-interactively it simply runs into the timeout. I doubt there's much to be done as long as we use the classic ipupdown networking, though. That's one of the many perks we'll have when we eventually ditch it in favour of systemd-networkd.

Yes this is exactly my experience and the work around sounds sane to me.

@Volans do you know if this is still an issue

@jbond , no idea if this is till happening, I guess we could look at a bunch of puppet run logs from the reimages and see if there is any standing out.