Page MenuHomePhabricator

Connection errors from puppetmaster1002 to puppetdb
Closed, ResolvedPublic

Description

On 2024-02-21 starting at around 22.30 UTC Puppet was failing on servers using Puppet 5 in eqiad with an error like this:

jmm@es1031:~$ sudo puppet agent -tv
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Error 500 on SERVER: Server Error: Could not retrieve facts for es1031.eqiad.wmnet: Failed to find facts from PuppetDB at puppet:8140: undefined method `content' for nil:NilClass
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: undefined method `content' for nil:NilClass
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

A restart of Apache and a reboot of puppetmaster1002 did not help. Eventually puppetmaster1002 was taken out of rotation via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005708

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

A restart of Apache and a reboot of puppetmaster1002 did not help.

This restarts probably had different effects. It seems the apache restart (8:17 UTC) lowered the puppet failures a bit. And the reboot afterwards (8:42 UTC) made the failures worse again.

https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=1708549978316&to=1708593178316&viewPanel=6

Probably that's a red herring because of disabling puppet and we did not force a eqiad-wide puppet run. But maybe something worth investigating (so apache reestart vs vm restart, issues with apache starting properly).

Definitely kind of strange. IP connectivity between these hosts is ok:

cmooney@es1031:~$ ping puppetmaster1002 
PING puppetmaster1002(puppetmaster1002.eqiad.wmnet (2620:0:861:107:10:64:48:45)) 56 data bytes
64 bytes from puppetmaster1002.eqiad.wmnet (2620:0:861:107:10:64:48:45): icmp_seq=1 ttl=63 time=0.248 ms

cmooney@es1031:~$ ping -4 puppetmaster1002 
PING  (10.64.48.45) 56(84) bytes of data.
64 bytes from puppetmaster1002.eqiad.wmnet (10.64.48.45): icmp_seq=1 ttl=63 time=0.176 ms
64 bytes from puppetmaster1002.eqiad.wmnet (10.64.48.45): icmp_seq=2 ttl=63 time=0.290 ms

cmooney@puppetmaster1002:~$ ping puppetdb1003
PING puppetdb1003(puppetdb1003.eqiad.wmnet (2620:0:861:102:10:64:16:87)) 56 data bytes
64 bytes from puppetdb1003.eqiad.wmnet (2620:0:861:102:10:64:16:87): icmp_seq=1 ttl=63 time=0.272 ms
64 bytes from puppetdb1003.eqiad.wmnet (2620:0:861:102:10:64:16:87): icmp_seq=2 ttl=63 time=0.311 ms

cmooney@puppetmaster1002:~$ ping -4 puppetdb1003
PING puppetdb1003.eqiad.wmnet (10.64.16.87) 56(84) bytes of data.
64 bytes from puppetdb1003.eqiad.wmnet (10.64.16.87): icmp_seq=1 ttl=63 time=0.298 ms
64 bytes from puppetdb1003.eqiad.wmnet (10.64.16.87): icmp_seq=2 ttl=63 time=0.300 ms

cmooney@puppetmaster1002:~$ ping -4 puppetdb2003
PING puppetdb2003.codfw.wmnet (10.192.48.75) 56(84) bytes of data.
64 bytes from puppetdb2003.codfw.wmnet (10.192.48.75): icmp_seq=1 ttl=61 time=30.3 ms

cmooney@puppetmaster1002:~$ ping  puppetdb2003
PING puppetdb2003(puppetdb2003.codfw.wmnet (2620:0:860:104:10:192:48:75)) 56 data bytes
64 bytes from puppetdb2003.codfw.wmnet (2620:0:860:104:10:192:48:75): icmp_seq=1 ttl=61 time=30.4 ms

Looking at puppetmaster1002 it is not listening on TCP 8040, apache on it is not configured for that. Puppetmaster1001 is listening on that port (due to /etc/apache2/sites-available/50-puppetmaster1001-eqiad-wmnet.conf). But attempted connections to pupetmaster1002 on those ports do make it to the host (and are then dropped by iptables/nothing listening).

Connection errors from puppetmaster1002 to puppetdb

I don't see any active connections from puppetmaster1001 to either of the puppetdb hosts just now when checking. What kind of connections are involved here (i.e. tcp/udp/ports etc)?

If it definitely seems like a network thing we could try to capture a pcap, but I guess we'd need to re-enable puppetmaster1002 for that?

When I looked at this last night as the alerts were coming in, I noticed that some hosts were not reporting the connection failure but simply the content error; example https://puppetboard.wikimedia.org/report/parse1004.eqiad.wmnet/ef81872fca86746fbbd87800da1da74b64d3839b.

My suspicious about this being related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003112 was because of the timing (as far as I saw it, maybe I overlooked) all such error started after this patch was merged. But again, I want to be careful saying this because I am still not a 100%, but the timing and the error ($settings being {}) seem to match up?

MoritzMuehlenhoff claimed this task.

We never got to the bottom of this error, it was likely a hardware issue (given none of the other servers showed the same) and puppetmaster1002 has now been decommissioned, resolving.