First puppet run after reimage slow (connection timeout)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Volans
	Sep 10 2020, 8:32 PM

Description

@Papaul noticed that in recent reimages in codfw the time for the first puppet run has increased a lot.

Having a look at one reimage from today (es2028) I found a possible culprit, but i can't investigate further right now.

From Puppetboard (search for Connection timed out) we have:

2020-09-10T18:26:04.309+00:00	/Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]	info, ferm, file, ferm::conf, conf, defs, class, base::firewall, base, firewall, profile::base::firewall, profile, role::spare::system, role, spare, system	Scheduling refresh of Service[ferm]	/etc/puppet/modules/ferm/manifests/conf.pp:14
2020-09-10T18:41:47.467+00:00	/Stage[main]/Base::Firewall/Ferm::Conf[main]/File[/etc/ferm/conf.d/02_main]	err, ferm, file, ferm::conf, conf, main, class, base::firewall, base, firewall, profile::base::firewall, profile, role::spare::system, role, spare, system	Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/firewall/main-input-default-drop.conf: Connection timed out	/etc/puppet/modules/ferm/manifests/conf.pp:14

The failed step (the second) took 15 minutes to fail after the previous one.

The related apache log line on puppetmaster2001 (frontend) is:

2020-09-10T18:26:04     33054   2620:0:860:101:be97:e1ff:fe57:da84      proxy-server/200        259     GET     http://puppet:8140/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?environment=production&links=manage&checksum_type=md5&source_permissions=ignore   -       application/json        -       -       Puppet/5.5.10 Ruby/2.5.5-p157 (x86_64-linux-gnu)        -       -       -       -       2620:0:860:101:be97:e1ff:fe57:da84      -       -

Note the automatic IPv6 address.

And the same on puppetmaster2003 (backend) should be:

2020-09-10T18:26:04     32184   2620:0:860:101:10:192:0:27      -/200   259     GET     http://puppetmaster2003.codfw.wmnet:8141/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?environment=production&links=manage&checksum_type=md5&source_permissions=ignore     -       application/json        -       2620:0:860:101:be97:e1ff:fe57:da84      Puppet/5.5.10 Ruby/2.5.5-p157 (x86_64-linux-gnu)        -       -       -       -       2620:0:860:101:10:192:0:27      -
       -

That's the last request with that IP, the ones after are made with the mapped IPv6 2620:0:860:101:10:192:0:140.

On es2028 we have:

Sep 10 18:26:04 es2028 puppet-agent[1196]: (/Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Augeas[eno3_v6_token]/returns) executed successfully
Sep 10 18:26:04 es2028 puppet-agent[1196]: (/Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Exec[eno3_v6_token]/returns) executed successfully

That matches perfectly and explains the IP change. My understanding is that the application of the mapped IP prevents the following request to the puppetmaster to succeed and there is a 15 minutes timeout that pauses the whole run.
@jbond maybe you have more context and/or this has already been investigated in similar cases?

Details

	Subject	Repo	Branch	Lines +/-
	wmf-auto-reimage: first Puppet run, force IPv4	operations/puppet	production	+3 -2

Customize query in gerrit

Event Timeline

Volans triaged this task as Medium priority.Sep 10 2020, 8:32 PM

Volans created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 10 2020, 8:32 PM

From what I can tell that's a known issue/longstanding behaviour with enabling the mapped addressed, the puppet run stalls/times out with the "ifup" call with does the actual address change. When enabling it manually on servers I usually CRTL-Ced the ifup command and re-run puppet (which fixes it), but non-interactively it simply runs into the timeout. I doubt there's much to be done as long as we use the classic ipupdown networking, though. That's one of the many perks we'll have when we eventually ditch it in favour of systemd-networkd.

@MoritzMuehlenhoff would forcing the first puppet run to use the v4 address that is not changing an acceptable workaround?
I did a quick try and using --sourceaddress $IPv4 the apache logs on the puppetmaster reports the v4 address.

That's a good idea, AFAICT that should reliably bypass the issue!

Volans added a subscriber: elukey.Sep 11 2020, 7:02 AM

Change 626553 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: first Puppet run, force IPv4

https://gerrit.wikimedia.org/r/626553

gerritbot added a project: Patch-For-Review.Sep 11 2020, 7:04 AM

Change 626553 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: first Puppet run, force IPv4

https://gerrit.wikimedia.org/r/626553

In T262609#6452359, @MoritzMuehlenhoff wrote:

From what I can tell that's a known issue/longstanding behaviour with enabling the mapped addressed, the puppet run stalls/times out with the "ifup" call with does the actual address change. When enabling it manually on servers I usually CRTL-Ced the ifup command and re-run puppet (which fixes it), but non-interactively it simply runs into the timeout. I doubt there's much to be done as long as we use the classic ipupdown networking, though. That's one of the many perks we'll have when we eventually ditch it in favour of systemd-networkd.

Yes this is exactly my experience and the work around sounds sane to me.

Maintenance_bot removed a project: Patch-For-Review.Sep 11 2020, 9:10 AM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

joanna_borun edited projects, added Puppet-Core, SRE-tools; removed Puppet.Jun 12 2023, 3:01 PM

@Volans do you know if this is still an issue

@jbond , no idea if this is till happening, I guess we could look at a bunch of puppet run logs from the reimages and see if there is any standing out.

First puppet run after reimage slow (connection timeout)Open, MediumPublicActions

Description

Details

Event Timeline

First puppet run after reimage slow (connection timeout)
Open, MediumPublic
Actions