Page MenuHomePhabricator

ms-be2050 shows network errors
Closed, ResolvedPublic

Description

I noticed that ms-be2050 showed some network reachability issues from icinga alerts, and after a quick check on the logs it seems that puppet/rsyslog/etc.. all started to show connectivity problems from ~ Dec 30th 10:24 UTC.

The only interesting thing that I found was the librenms graph for errors registered on the asw-d-codfw side, that clearly show a regression around that time (the other traffic graphs are also showing a change in behavior).

Event Timeline

elukey created this task.Sat, Jan 2, 10:21 AM
Restricted Application added a project: SRE. · View Herald TranscriptSat, Jan 2, 10:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey added a comment.Sun, Jan 3, 8:51 AM

It is very strange since from /var/log/swift I see the host logging requests, and pings to other ms-be in codfw work, but TCP conns to the puppet master for example fail:

elukey@ms-be2050:~$ telnet -4 puppet.eqiad.wmnet 8140
Trying 10.64.16.73...
telnet: Unable to connect to remote host: Connection timed out
elukey@ms-be2050:~$ telnet -6 puppet.eqiad.wmnet 8140
Trying 2620:0:861:102:10:64:16:73...
telnet: Unable to connect to remote host: Network is unreachable
elukey@ms-be2050:~$ ping puppetmaster1001.eqiad.wmnet
PING puppetmaster1001.eqiad.wmnet (10.64.16.73) 56(84) bytes of data.
^C
--- puppetmaster1001.eqiad.wmnet ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6146ms

elukey@ms-be2050:~$ ping ms-be2049.codfw.wmnet
PING ms-be2049.codfw.wmnet (10.192.32.13) 56(84) bytes of data.
64 bytes from ms-be2049.codfw.wmnet (10.192.32.13): icmp_seq=1 ttl=63 time=0.123 ms
64 bytes from ms-be2049.codfw.wmnet (10.192.32.13): icmp_seq=2 ttl=63 time=0.108 ms
64 bytes from ms-be2049.codfw.wmnet (10.192.32.13): icmp_seq=3 ttl=63 time=0.075 ms
^C
--- ms-be2049.codfw.wmnet ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2056ms
elukey added a comment.Sun, Jan 3, 8:58 AM

Something might be messed up in the network config, I see a strange routing for v6 (no G flags for example):

elukey@ms-be2050:~$ sudo route -n -6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2620:0:860:104::/64            ::                         U    256 0     0 eno1
fe80::/64                      ::                         U    256 0     0 eno1
::/0                           ::                         !n   -1  1 34099 lo
::1/128                        ::                         Un   0   41 68839 lo
2620:0:860:104:10:192:48:117/128 ::                         Un   0   2 11613 lo
fe80::b226:28ff:fe1a:5508/128  ::                         Un   0   2 11266 lo
ff00::/8                       ::                         U    256 1 87613 eno1
::/0                           ::                         !n   -1  1 34099 lo

elukey@ms-be2050:~$ sudo route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.192.48.1     0.0.0.0         UG    0      0        0 eno1
10.192.48.0     0.0.0.0         255.255.252.0   U     0      0        0 eno1

Mentioned in SAL (#wikimedia-operations) [2021-01-03T09:07:12Z] <elukey> reboot ms-be2050 as attempt to recover/fix its broken networking state (started from Dec 30th) - T271041

elukey added a comment.Sun, Jan 3, 9:26 AM

Something changed:

  • puppet now runs on ipv4
  • swift container availability showed a recovery.

But same issue with ipv6..

elukey added a comment.EditedSun, Jan 3, 6:33 PM

Some notes after tests:

  1. I don't see Router Advertisements using tcpdumps on ms-be2050, but I see them on all other nodes. I don't recall if the default gw settings are set via RA or via another static config (but I don't see any in /etc/network/interfaces).
  2. All nodes ms-be2* have something like ::/0 fe80::1 UGDAe 1024 40143690 eno1 listed in route -n -6, except our dear ms-be2050.
elukey added a comment.Sun, Jan 3, 6:49 PM

I am out of ideas, the next thing that I'd check is if the fiber between the switch and the host needs to be replaced..

Mentioned in SAL (#wikimedia-operations) [2021-01-04T09:02:40Z] <XioNoX> bounce asw-d-codfw:xe-7/0/8 - T271041

ayounsi added a subscriber: ayounsi.Mon, Jan 4, 9:18 AM

Symptoms are a bit similar to T269313 but I don't think it's the same issue as the switch port is showing dropped multicast traffic for no reason.

asw-d-codfw> show interfaces queue xe-7/0/8
Queue: 8, Forwarding classes: mcast
  Queued:
    Packets              :                     0                     0 pps
    Bytes                :                     0                     0 bps
  Transmitted:
    Packets              :                     0                     0 pps
    Bytes                :                     0                     0 bps
    Tail-dropped packets : Not Available  
    RL-dropped packets   :                     0                     0 pps
    RL-dropped bytes     :                     0                     0 bps
    Total-dropped packets:                   835                     5 pps      <------
    Total-dropped bytes  :                 66928                  3568 bps      <------

As the host is using a DAC I'd say try to replace it first, then try a different switch port.
If the issue still happen: upgrade the host NIC.

elukey added a subscriber: Papaul.

@Papaul Hi! happy new year :)

When you are in can you ping me or Filippo to swap the DAC between ms-be2050 and asw-d-codfw?

elukey triaged this task as Medium priority.Mon, Jan 4, 9:25 AM
Papaul closed this task as Resolved.Mon, Jan 4, 3:29 PM
Papaul claimed this task.
Queue: 8, Forwarding classes: mcast
  Queued:
    Packets              :                     0                     0 pps
    Bytes                :                     0                     0 bps
  Transmitted:
    Packets              :                  2948                     5 pps
    Bytes                :               1627890                  3400 bps
    Tail-dropped packets : Not Available
    RL-dropped packets   :                     0                     0 pps
    RL-dropped bytes     :                     0                     0 bps
    Total-dropped packets:                     0                     0 pps
    Total-dropped bytes  :                     0                     0 bps
 puppetmaster1001.eqiad.wmnet ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 31.757/31.804/31.830/0.227 ms

DAC cable replaced

elukey awarded a token.Mon, Jan 4, 3:55 PM