Page MenuHomePhabricator

Puppet failures on integration-agent instances - unable to connect to integration-puppetserver-01
Closed, ResolvedPublic

Description

Noted a bunch of Cloud VPS alerts when checking mail this morning:

Puppet is having issues on the "integration-agent-docker-1053.integration.eqiad1.wikimedia.cloud (172.16.3.222)" instance in project
integration in Wikimedia Cloud VPS.

Puppet is running with failures.
...

ERR: Connection to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 3.083 seconds: Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)

Seems to be ongoing:

brennen@integration-agent-docker-1053:~$ sudo run-puppet-agent
Info: Using environment 'production'
Error: Connection to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 3.076 seconds: Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)
Wrapped exception:
Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)
Error: No more routes to fileserver
Info: Loading facts
Error: Connection to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 0.358 seconds: Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)
Wrapped exception:
Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)
Error: Could not retrieve catalog from remote server: No more routes to puppet
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Connection to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 3.057 seconds: Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)
Wrapped exception:
Failed to open TCP connection to integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "integration-puppetserver-01.integration.eqiad1.wikimedia.cloud" port 8140)
Error: Could not send report: No more routes to report

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2025-08-04T14:40:11Z] <brennen> attempting soft reboot in horizon interface on integration-puppsetserver-01 (T401123)

Unable to ssh to puppetserver. In horizon log tab for integration-puppetserver-01:

[5131058.073822] Out of memory: Killed process 2281119 (java) total-vm:6296688kB, anon-rss:3430724kB, file-rss:0kB, shmem-rss:0kB, UID:104 pgtables:7344kB oom_score_adj:0

Trying a reboot.

After reboot:

brennen@integration-agent-docker-1053:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for integration-agent-docker-1053.integration.eqiad1.wikimedia.cloud
Info: Applying configuration version '(d60b770ed9) Effie Mouzeli - dsh: remove testservers from scap destinations 1'
Notice: Applied catalog in 10.29 seconds

Seems resolved for the moment, but I'm curious if this'll recur. More memory needed?

Puppet 7 servers never fail to surprise me with how much RAM they want. I'd definitely start by doubling the RAM on that VM before investigating anything else.

Puppet 7 servers never fail to surprise me with how much RAM they want. I'd definitely start by doubling the RAM on that VM before investigating anything else.

Makes sense - I can bump to a g4.cores4.ram8.disk20. Trying to remember if there are any footguns with attached volumes or anything like that when resizing...

Mentioned in SAL (#wikimedia-releng) [2025-08-04T16:08:34Z] <brennen> integration: resize integration-puppetserver-01 to g4.cores4.ram8.disk20; confirmed restart and successful puppet run from an integration agent (T401123)

brennen claimed this task.
brennen moved this task from Radar to Done or Declined on the User-brennen board.