Page MenuHomePhabricator

Icinga is randomly loosing connectivity to maps1002
Closed, ResolvedPublic

Description

Icinga seems to be randomly loosing connectivity to maps1002.eqiad.wmnet. maps1002 is not yet configured and only has the default puppet role applied. It is in scheduled downtime in Icinga, so those alerts are not sent. This server had some issue during initial installation (see T135018#2372770), which might or might not be related. I don't see anything strange in dmesg. /var/log/syslog indicates that diamond also has intermittent connection issues with Graphite.

This leads me to think that this is a more generic networking issue for that host.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Some more investigation with @RobH:

  • ethtool reports link is at 1Go, no issue
  • icinga failures are spread across the whole day, so issue is probably not related to anyone working in the rack
  • I experienced a few SSH disconnection (packet_write_wait: Connection to UNKNOWN: Broken pipe)
  • no iptable rules are configured

@Cmjohnson could you try changing the eth cable of that server? Thanks!

Swapped the cable today. Let me know if gets better.

@Cmjohnson Thanks! I'll keep an eye on Icinga history for that host. We'll see if it gets better...

Connectivity issue continues after switching cable. We'll need to find another cause. @Cmjohnson sorry for the bother!

Gehel added a subscriber: faidon.

Thanks to @faidon, it seems the issue is conflicting IP address configuration with ores1002. ores1002 seems to have been decommissionned, but not wiped or shutdown. @Cmjohnson, is that something that you could have a look into?

4:39 PM <paravoid> faidon@re1.cr1-eqiad> show arp no-resolve | match 10.64.16.42
4:39 PM <paravoid> 14:18:77:33:4a:d2 10.64.16.42 ae2.1018 none
4:39 PM <paravoid> faidon@re0.cr2-eqiad> show arp no-resolve | match 10.64.16.42
4:39 PM <paravoid> 1c:98:ec:21:88:b4 10.64.16.42 ae2.1018 none
4:39 PM <paravoid> faidon@asw-b-eqiad> show ethernet-switching table | match 14:18:77:33:4a:d2 private1-b-eqiad 14:18:77:33:4a:d2 > Learn 0 ge-4/0/7.0
4:40 PM <paravoid> faidon@asw-b-eqiad> show ethernet-switching table | match 1c:98:ec:21:88:b4 private1-b-eqiad 1c:98:ec:21:88:b4 Learn 0 ge-4/0/2.0
4:40 PM <paravoid> ge-4/0/2 up up maps1002
4:40 PM <paravoid> ge-4/0/7 up up ores1002
4:40 PM <paravoid> commit 8327eed35090807d89ce62db853cc1e375ca1738
4:40 PM <paravoid> Removing dns entries for ores1001 and ores1002
4:40 PM <paravoid> -ores1001 1H IN A 10.64.0.12
4:40 PM <paravoid> -ores1002 1H IN A 10.64.16.42
4:41 PM <paravoid> so, the server was removed from DNS
4:41 PM <paravoid> but not actually wiped or even turned off?

Thanks @Cmjohnson ! I'll check again alert history to make sure no new issue are seen. And I'll close this issue if all seems OK.

No new issues seen in icinga history, all looks good, closing. Thanks @faidon and @Cmjohnson for your help!