Page MenuHomePhabricator

frack / passive icinga checks: Errors connecting to icinga2001.wikimedia.org
Closed, ResolvedPublic

Description

Just noticed in /var/log/messages a lot of:

Dec 10 23:03:24 frpm1001 nagios_nsca[308]: STDERR [208.80.153.74] Error: Could not read init packet from server                                        
Dec 10 23:03:25 frpm1001 nagios_nsca[308]: STDERR [208.80.153.74] Error: Server closed connection before init packet was received

Seeing it in both eqiad and codfw hosts. Server responds to ping, but I don't have keys for ssh. @Dzahn - anything you are aware of?

Event Timeline

icinga2001 is considered a stand-by and not active, but that doesn't mean it should drop the packets, i would say.
i will take a look.

Dzahn renamed this task from Errors connecting to icinga2001.wikimedia.org to frack / passive icinga checks: Errors connecting to icinga2001.wikimedia.org.Dec 10 2018, 11:37 PM
Dzahn added a project: Icinga.
Dzahn added a project: SRE.
Dzahn triaged this task as Medium priority.Dec 10 2018, 11:39 PM

normal prio, the production checks are working, it's about the standby host dropping packets from send_nsca for passive checks

@cwdent is it new and started at a certain point or has it been like that for a while? I don't see differences in iptables between icinga1001 and icinga2001 for this. Both allow the frack subnets for the nsca port. The password for decryption is also the same on both and the status says nsca is running. Is it possible the difference is in pfw? Like a firewall hole for the icinga server has just existed for eqiad and not for codfw?

Mentioned in SAL (#wikimedia-operations) [2018-12-11T00:15:43Z] <mutante> icinga2001 - killed all nagios processes, restarted nsca service, something is different from icinga1001, service failed when trying to restart (T211641)

@Dzahn it goes back as far as current syslog, I'll dig back and see when it started. Fwiw I can telnet to 208.80.154.84 5667 and it works, but 208.80.153.74 says Connection Refused. iptables&pfw both look to be open.

Did anything change now ? I killed and restarted nsca on icinga2001.

@Dzahn it goes back as far as current syslog, I'll dig back and see when it started. Fwiw I can telnet to 208.80.154.84 5667 and it works, but 208.80.153.74 says Connection Refused. iptables&pfw both look to be open.

I can telnet to both IPs on port 5667 from both icinga1001 and icinga2001 and to the other one.

@Dzahn yep looks good now! Thanks :)