Page MenuHomePhabricator

Cr1-eqiad comms problem when moving to 40G row D handoff
Closed, ResolvedPublic

Description

The planned move from 4x10G links between cr1-eqiad and asw2-d-eqiad has now been attempted twice, and we have had problems on both occasions.

The first attempt was documented in the incident report. Following that another attempt was made to perform the move today, Oct 11th, taking greater care to double-check interface status, ensure rollback was ready etc.

Once again the same issue was encountered. While the physical interfaces were up and looked healthy both sides, MAC addresses being learnt on the switch etc, once the IP addressing was added to each vlan sub-int on the router problems were observed.

Specifically it seems that traffic destined for directly connected hosts on the row D vlans was not making it from CR1 to the hosts. Outbound traffic from hosts in row D still worked, as cr2-eqiad remained VRRP master / gateway for each subnet. But traffic destined for the hosts would not make it if that traffic routed via cr1-eqiad (due to how the routing works some traffic for the subnets would route via cr2-eqiad, and thus also work in that direction).

A few initial checks were done this evening to try to isolate where the problem was.

Test 1: cr1-eqiad to row D host using test subnet

We configured IP 198.18.0.1/30 (from RFC2544 test range) on cr1-eqiad, ae4.1004 (public1-d-eqiad). And then added 198.18.0.2/30 as secondary IP to virtual machine doh1002, which is on that vlan (not interfering with its primary IP of course).

With this in place we could ping fine from cr1-eqiad to doh1002 on the test IPs. This tells us that the new 40G link, optics and switch config etc. are all fine and traffic should be able to flow over the interface to hosts.

Test 2: cr1-eqiad to host check

As things appeared to work with a test / new subnet we wanted to use IP addressing that was part of our production range, to see if the specific ranges in question were part of the problem.

To do this we assigned unused IP address 10.64.48.4/31 to ae4.1020 of cr1-eqiad. This is a smaller network that overlaps with the 10.64.48.0/22 range on the private1-d-eqiad subnet. Neither of the IPs in the /31 were already in use, making it safe to apply.

With this in place 10.64.48.5/31 was configured on sretest1001. With both of these up comms between the devices worked fine, we could ping 10.64.48.5 from 10.64.48.4 on cr1.

After this was done the 10.64.48.5 IP was added as a secondary IP on cr2-eqiad ae4.1020. When this IP was on cr2-eqiad pings could be made between cr1-eiqad (10.64.48.5) and cr2-eqiad (10.64.48.4) without problem.

Test 3: add unicast IP from private1-d-eqiad to cr1-eqiad, but don't enable VRRP

To verify if the issue was related to participation in VRRP (cr1 would not have taken over as master regardless, but in case this was somehow happening), the normal unicast IP of cr1 on this subnet, 10.64.48.2/22 was added to ae4.1020.

As soon as this config was applied problems again became apparent, so a quick rollback was done. Anticipating potential problems several pre-prepared checks were done while the config was on cr1, however, which shed some light on the problem state.

  • Observation 1: CR1 appears to send ARP requests for hosts on the subnet, but does not report any responses from those hosts.

When the connected subnet is added to cr1 we expect it will send ARP requests for host IPs on the same subnet. Running a monitor traffic command while the problem was present showed that cr1 was apparently generating such requests:

cmooney@re0.cr1-eqiad> monitor traffic interface ae4.1020 no-resolve layer2-headers matching arp    
Oct 11 17:50:36
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is OFF.
Listening on ae4.1020, capture size 96 bytes

17:50:36.098478  In 84:18:88:0d:df:c9 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 60: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.41 tell 10.64.48.3
17:50:39.481942 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.2 tell 10.64.48.2
17:50:39.489991 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.105 tell 10.64.48.2
17:50:39.490115 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.230 tell 10.64.48.2
17:50:39.490198 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.229 tell 10.64.48.2
17:50:39.490269 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.64 tell 10.64.48.2
17:50:39.490394 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.21 tell 10.64.48.2
17:50:39.490532 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.52 tell 10.64.48.2
17:50:39.490710 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.106 tell 10.64.48.2
17:50:39.490801 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.50 tell 10.64.48.2
17:50:39.490868 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.113 tell 10.64.48.2
17:50:39.491016 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.95 tell 10.64.48.2
17:50:39.510238 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.131 tell 10.64.48.2
17:50:39.510351 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.225 tell 10.64.48.2
17:50:39.510374 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.210 tell 10.64.48.2
17:50:39.510393 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.83 tell 10.64.48.2
17:50:39.510411 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.96 tell 10.64.48.2
17:50:39.510429 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.14 tell 10.64.48.2
17:50:39.510446 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.82 tell 10.64.48.2
17:50:39.510464 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.78 tell 10.64.48.2
17:50:39.510481 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.132 tell 10.64.48.2
17:50:39.510577 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.216 tell 10.64.48.2
17:50:39.510673 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.84 tell 10.64.48.2
17:50:39.510977 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.217 tell 10.64.48.2
17:50:39.511000 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.13 tell 10.64.48.2
17:50:39.511112 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.220 tell 10.64.48.2
17:50:39.511133 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.219 tell 10.64.48.2
17:50:39.511150 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.213 tell 10.64.48.2
17:50:39.511168 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.223 tell 10.64.48.2
17:50:39.511185 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.221 tell 10.64.48.2
17:50:39.511202 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.202 tell 10.64.48.2
17:50:39.511219 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.199 tell 10.64.48.2
17:50:39.511352 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.195 tell 10.64.48.2
17:50:39.511463 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.193 tell 10.64.48.2
17:50:39.511492 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.103 tell 10.64.48.2
17:50:39.511673 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.109 tell 10.64.48.2
17:50:39.511704 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.206 tell 10.64.48.2
17:50:39.511843 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.25 tell 10.64.48.2
17:50:39.511888 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.228 tell 10.64.48.2
17:50:39.511908 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.66 tell 10.64.48.2
17:50:39.512081 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.129 tell 10.64.48.2
17:50:39.512104 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.19 tell 10.64.48.2
17:50:39.512122 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.31 tell 10.64.48.2
17:50:39.512215 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.226 tell 10.64.48.2
17:50:39.512238 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.214 tell 10.64.48.2
17:50:39.512288 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.222 tell 10.64.48.2
17:50:39.512372 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.91 tell 10.64.48.2
17:50:39.512456 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.176 tell 10.64.48.2
17:50:39.512479 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.137 tell 10.64.48.2
17:50:39.512496 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.204 tell 10.64.48.2
17:50:39.512777 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.104 tell 10.64.48.2
17:50:39.513019 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.33 tell 10.64.48.2
17:50:39.514103 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.92 tell 10.64.48.2
17:50:39.514237 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.154 tell 10.64.48.2
17:50:39.514257 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.101 tell 10.64.48.2
17:50:39.514275 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.85 tell 10.64.48.2
17:50:39.514292 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.93 tell 10.64.48.2
17:50:39.514314 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.70 tell 10.64.48.2
17:50:39.514407 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.35 tell 10.64.48.2
17:50:39.514430 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.86 tell 10.64.48.2
17:50:39.514460 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.90 tell 10.64.48.2
17:50:39.514544 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.56 tell 10.64.48.2
17:50:39.514566 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.201 tell 10.64.48.2
17:50:39.514588 Out 5c:5e:ab:3d:87:c4 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1020, p 0, ethertype ARP, arp who-has 10.64.48.215 tell 10.64.48.2
<-- rest omitted -->

What is interesting to observe here is that there are no "In" packets with arp replies from any of these hosts. This results in the ARP table for the interface being mostly empty:

cmooney@re0.cr1-eqiad# run show arp no-resolve interface ae4.1020 
MAC Address       Address         Interface                Flags
84:18:88:0d:df:c9 10.64.48.3      ae4.1020                 none
bc:97:e1:c0:17:30 10.64.48.66     ae4.1020                 none

Ideally a packet capture / tcpdump would have been done on a connected host to see if the ARPs cr1 reports sending actually made it to end hosts or not. This should definitely be checked in any further tests.

  • Observation 2: CR1 responded to ARP requests and pings to 10.64.48.2 when it was configured

For instance using arping from sretest1001 responses were received from cr1-eqiad when it had 10.64.48.2/22 configured on ae4.1020:

ARPING 10.64.48.2 from 10.64.48.138 eno1
Unicast reply from 10.64.48.2 [5c:5e:ab:3d:87:c4] 0.872ms
Unicast reply from 10.64.48.2 [5c:5e:ab:3d:87:c4] 0.845ms
Sent 2 probe(s) (0 broadcast(s))
Received 2 response(s) (0 request(s), 0 broadcast(s))
  • Observation 3: Routing from CR2 to hosts on these subnets was not affected

Testing from lvs1016, on the private1-d-eqiad vlan, it was observed that traffic was not affected to remote hosts which were on vlans using cr2-eqiad as gateway (i.e. cr2 was vrrp master). Traffic to remote subnets that used cr1-eiqad as gateway did not get any response however.

cmooney@lvs1016:~$ sudo traceroute -I -w 1 netmon1002.wikimedia.org
traceroute to netmon1002.wikimedia.org (208.80.154.5), 30 hops max, 60 byte packets
 1  ae4-1020.cr2-eqiad.wikimedia.org (10.64.48.3)  0.275 ms  0.271 ms  0.302 ms
 2  * * *
 3  * * *
 4  * * *
cmooney@lvs1016:~$ 
cmooney@lvs1016:~$ sudo traceroute -I -w 1 208.80.154.30 
traceroute to alert1001.wikimedia.org (208.80.154.88), 30 hops max, 60 byte packets
 1  ae4-1020.cr2-eqiad.wikimedia.org (10.64.48.3)  0.512 ms  0.509 ms  0.554 ms
 2  alert1001.wikimedia.org (208.80.154.88)  0.189 ms  0.199 ms  0.197 ms

The explanation here is that when traffic routes back for the 10.64.48.0/22 subnet via cr2 it makes it, but when it routes back via cr1 it doesn't. It also confirms that the issue is with cr1 forwarding to hosts on its directly connected interface, and is not related to any routing changes that occur when CR1 announces the ranges via OSPF after they are applied.

Further validating the issue is within CR1. pings from lvs1016 to CR1's loopback interface were observed to stop when the range was applied on cr1-eqiad ae4.1020. i.e. the issue is that cr1-eqiad cannot transmit traffic out to hosts directly on the subnet.

Further Tests

As observed cr1-eqiad does not seem to be able to ARP for hosts on the directly connected networks when the physical links in ae4 are changed from the 4x10G bundle to 1x40G. However it does respond to hosts which send ARP requests to it during this time.

As could be seen above there were certain entries in the ae4.1020 ARP table while the problem was occurring. One theory is that CR1 does properly add entries to the ARP table when it receives an arp request from a host on the subnet, but for whatever reason fails to do so when an end host sends it an arp response following requests it generated itself.

Some of the tests, like pinging with the smaller subnet or test networks, did not appear to work at first for us, but then did. We weren't paying attention to which side initiated pings etc. (and thus arp request vs arp response) but this may be playing a factor. As cr2 is VRRP master on this subnet most hosts will have no reason to send arps for cr1's IP (10.64.48.2), so for it to build the ARP table for these it needs to process responses from end hosts. The few devices it could ping were perhaps because those other devices had first arp'd for it. To be tested further.

Event Timeline

cmooney changed the task status from Open to In Progress.
cmooney triaged this task as High priority.
cmooney updated the task description. (Show Details)
ayounsi added a subscriber: Jclark-ctr.

Change 843414 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Move all eqiad VRRP mastership to cr2

https://gerrit.wikimedia.org/r/843414

Change 843414 merged by Ayounsi:

[operations/homer/public@master] Move all eqiad VRRP mastership to cr2

https://gerrit.wikimedia.org/r/843414

Mentioned in SAL (#wikimedia-operations) [2022-10-17T08:55:14Z] <XioNoX> Move all eqiad VRRP mastership to cr2 - T320566

Mentioned in SAL (#wikimedia-operations) [2022-10-17T09:09:04Z] <XioNoX> de-pref cr1-eqiad wavelength transports (to codfw and drmrs) - T320566

Change 843416 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Drain eqiad-drmrs GTT link

https://gerrit.wikimedia.org/r/843416

Change 843416 merged by jenkins-bot:

[operations/homer/public@master] Drain eqiad-drmrs GTT link

https://gerrit.wikimedia.org/r/843416

Mentioned in SAL (#wikimedia-operations) [2022-10-17T09:24:31Z] <XioNoX> de-pref eqiad-drmrs GTT VPLS (latency between eqiad and drmrs will increase) - T320566

Mentioned in SAL (#wikimedia-operations) [2022-10-17T10:27:42Z] <XioNoX> disable cr1-eqiad:ae4 for recabling and troubleshooting - T320566

Mentioned in SAL (#wikimedia-operations) [2022-10-17T11:11:15Z] <XioNoX> cr1-eqiad> request chassis fpc slot 1 offline - T320566

Mentioned in SAL (#wikimedia-operations) [2022-10-17T12:36:54Z] <XioNoX> re-enable BGP between cr1 and lsw1-e1 - T320566

Myself and @ayounsi were able to narrow down the issue a bit more during testing yesterday.

It seems the issue is that asw2-d7-eqiad is not forwarding ARP broadcasts or NDP multicasts it receives on QSFP port et-2/0/49 to the other ports belonging to the same vlan.

One theory is that CR1 does properly add entries to the ARP table when it receives an arp request from a host on the subnet, but for whatever reason fails to do so when an end host sends it an arp response following requests it generated itself.

This turned out to be correct. The switch is transmitting ARP requests from end-hosts out over this port to cr1. When that happens the CR can populate its own ARP/neighbour table with the MAC for the host's IP, and the unicast frames it sends over the port are forwarded correctly by the switch to the host.

Current Test Setup

cr1-eqiad

Right now we have et-1/1/3 on cr1-eqiad configured as follows, with test IPs configured from the RFC2544 Test range and RFC3849 IPv6 documentation range (neither publicly routable). It has been removed from any AE/LACP bundle, and no vlan-tagging is configured on the link:

cmooney@re0.cr1-eqiad> show configuration interfaces et-1/1/3                  
description "Test: 40G to asw2-d2-eqiad et-2/0/49 - T320566";
unit 0 {
    family inet {
        address 198.18.0.1/30;
    }
    family inet6 {
        address 2001:db8::1/64;
    }
}

asw2-d-eqiad

We have made port et-2/0/49, connecting to the cr, an access port in the private1-d-eqiad vlan as follows:

cmooney@asw2-d-eqiad> show configuration interfaces et-2/0/49 
description "Core: cr1-eqiad:et-1/1/3 {#G2204190495000072}";
unit 0 {
    family ethernet-switching {
        interface-mode access;
        vlan {
            members private1-d-eqiad;
        }
    }
}

sretest1001

Sretest1001 is on row D so we have added IPs from the test subnets configured on the CR:

cmooney@sretest1001:~$ ip -br addr show eno1
eno1             UP             10.64.48.138/22 198.18.0.2/30 2001:db8::10:64:48:138/64 2001:db8::2/64 2620:0:861:107:10:64:48:138/64 fe80::d294:66ff:fe5f:6720/64
Tests

1. Ping sretest1001 (198.18.0.2) from cr1-eqiad through asw2-d2-eqiad vlan private1-d-eqiad

With the above config in place this should work. The CR will first send an ARP broadcast to get the MAC of the device using 198.18.0.2, which sretest1001 should recieve and respond to. After this the CR should be able to send unicast ICMP echo packets to sretest1001, and it should reply again.

First check things look as expected in terms of routing, and verify the device mac address to use in the capture filters:

cmooney@re0.cr1-eqiad> show route 198.18.0.2 table inet.0 

inet.0: 894548 destinations, 4040878 routes (893497 active, 0 holddown, 5379 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

198.18.0.0/30      *[Direct/0] 00:05:59
                    >  via et-1/1/3.0
cmooney@re0.cr1-eqiad> show arp no-resolve interface et-1/1/3.0 

{master}
cmooney@re0.cr1-eqiad>
cmooney@re0.cr1-eqiad> show interfaces et-1/1/3 | match address 
  Current address: 5c:5e:ab:3d:81:a8, Hardware address: 5c:5e:ab:3d:81:a8
cmooney@sretest1001:~$ ip -br link show eno1
eno1             UP             d0:94:66:5f:67:20 <BROADCAST,MULTICAST,UP,LOWER_UP>

Now start the ping, and monitor traffic on the interface:

cmooney@re0.cr1-eqiad> ping 198.18.0.2 source 198.18.0.1 count 3 
PING 198.18.0.2 (198.18.0.2): 56 data bytes

--- 198.18.0.2 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

We can see the ARP requests outbound from cr on the local interface:

cmooney@re0.cr1-eqiad> monitor traffic interface et-1/1/3 no-resolve layer2-headers matching "ether host 5c:5e:ab:3d:81:a8 or (ether host d0:94:66:5f:67:20 and not host 10.64.48.138)"   
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is OFF.
Listening on et-1/1/3, capture size 96 bytes

19:09:21.562466 Out 5c:5e:ab:3d:81:a8 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: arp who-has 198.18.0.2 tell 198.18.0.1
19:09:22.249116 Out 5c:5e:ab:3d:81:a8 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: arp who-has 198.18.0.2 tell 198.18.0.1

But looking on sretest1001 it never receives these broadcasts:

cmooney@sretest1001:~$ sudo tcpdump -e -i eno1 -l -nn "ether host 5c:5e:ab:3d:81:a8 or ((ether host d0:94:66:5f:67:20 and (arp or icmp)) and not host 10.64.48.138)"
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes

If we instead ping the CR's IP from sretest1001, it works ok. Looking at the captures we can see the ARP request sretest1001 sends makes it to the CR, and it's (unicast) response is returned without problem:

cmooney@sretest1001:~$ ping -c 4 -I 198.18.0.2 198.18.0.1
PING 198.18.0.1 (198.18.0.1) from 198.18.0.2 : 56(84) bytes of data.
64 bytes from 198.18.0.1: icmp_seq=1 ttl=64 time=1.58 ms
64 bytes from 198.18.0.1: icmp_seq=2 ttl=64 time=0.661 ms
cmooney@re0.cr1-eqiad> monitor traffic interface et-1/1/3 no-resolve size 1500 layer2-headers matching "arp or icmp"                                                    
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is OFF.
Listening on et-1/1/3, capture size 1500 bytes

19:18:28.805438  In d0:94:66:5f:67:20 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 198.18.0.1 tell 198.18.0.2
19:18:28.805467 Out 5c:5e:ab:3d:81:a8 > d0:94:66:5f:67:20, ethertype ARP (0x0806), length 42: arp reply 198.18.0.1 is-at 5c:5e:ab:3d:81:a8
19:18:28.806226  In PFE proto 2 (ipv4): 198.18.0.2 > 198.18.0.1: ICMP echo request, id 32914, seq 1, length 64
19:18:28.806236 Out 5c:5e:ab:3d:81:a8 > d0:94:66:5f:67:20, ethertype IPv4 (0x0800), length 98: 198.18.0.1 > 198.18.0.2: ICMP echo reply, id 32914, seq 1, length 64
cmooney@sretest1001:~$ sudo tcpdump -e -i eno1 -l -nn "ether host 5c:5e:ab:3d:81:a8 or ((ether host d0:94:66:5f:67:20 and (arp or icmp)) and not host 10.64.48.138)"
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
19:13:35.674386 d0:94:66:5f:67:20 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 198.18.0.1 tell 198.18.0.2, length 28
19:13:35.674972 5c:5e:ab:3d:81:a8 > d0:94:66:5f:67:20, ethertype ARP (0x0806), length 60: Reply 198.18.0.1 is-at 5c:5e:ab:3d:81:a8, length 46
19:13:35.674987 d0:94:66:5f:67:20 > 5c:5e:ab:3d:81:a8, ethertype IPv4 (0x0800), length 98: 198.18.0.2 > 198.18.0.1: ICMP echo request, id 58309, seq 1, length 64
19:13:35.675492 5c:5e:ab:3d:81:a8 > d0:94:66:5f:67:20, ethertype IPv4 (0x0800), length 98: 198.18.0.1 > 198.18.0.2: ICMP echo reply, id 58309, seq 1, length 64

The ARP entry for the host's MAC is correctly populated on cr1-eqiad, on receipt of the ARP request sent by the host:

cmooney@re0.cr1-eqiad> show arp interface et-1/1/3.0   
MAC Address       Address         Name                      Interface               Flags
d0:94:66:5f:67:20 198.18.0.2      198.18.0.2                et-1/1/3.0              none

So the problem is the ARP requests the CR is sending do not make it to other hosts on the conencted Vlan. What we can't be sure of at this stage is if these requests are actually transmitted to the switch, or if something on the CR itself is blocking or mangling them.

Test 2. Re-configure et-2/0/49 on asw2-d2-eqiad as a routed port, and move the IP we had been using on sretest1001 to it.

To try to verify if the problem is the CR failing to send ARP requests out on et-1/1/3, or if the switch is for some reason not forwarding those it receives from the CR, we set the switch up with a routed port with the IP we had been using on sretest1001:

cmooney@asw2-d-eqiad> show configuration interfaces et-2/0/49  
description "Core: cr1-eqiad:et-1/1/3 {#G2204190495000072}";
unit 0 {
    family inet {
        address 198.18.0.2/30;
    }
}
cmooney@asw2-d-eqiad> show arp interface et-2/0/49.0 

{master:7}

We now initiate a ping as before to 198.18.0.2:

cmooney@re0.cr1-eqiad> ping 198.18.0.2 source 198.18.0.1 count 3    
PING 198.18.0.2 (198.18.0.2): 56 data bytes
64 bytes from 198.18.0.2: icmp_seq=0 ttl=64 time=33.981 ms
64 bytes from 198.18.0.2: icmp_seq=1 ttl=64 time=3.945 ms
64 bytes from 198.18.0.2: icmp_seq=2 ttl=64 time=11.514 ms

Looking at our capture on cr1-eqiad, we something quite different:

cmooney@re0.cr1-eqiad> monitor traffic interface et-1/1/3 no-resolve size 1500 layer2-headers matching "arp or icmp"    
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is OFF.
Listening on et-1/1/3, capture size 1500 bytes

19:31:26.374453 Out 5c:5e:ab:3d:81:a8 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: arp who-has 198.18.0.2 tell 198.18.0.1
19:31:26.396311  In 84:c1:c1:82:78:37 > 5c:5e:ab:3d:81:a8, ethertype ARP (0x0806), length 60: arp reply 198.18.0.2 is-at 84:c1:c1:82:78:37
19:31:26.396321 Out 5c:5e:ab:3d:81:a8 > 84:c1:c1:82:78:37, ethertype IPv4 (0x0800), length 98: 198.18.0.1 > 198.18.0.2: ICMP echo request, id 14783, seq 0, length 64
19:31:26.408333  In PFE proto 2 (ipv4): 198.18.0.2 > 198.18.0.1: ICMP echo reply, id 14783, seq 0, length 64

This time the ARP request got an immediate response back in (with the MAC of the switch). We can tell from this that the CR is indeed sending the required ARP requests outbound towards the switch, and they arrive properly formed.

In this case the switch process the ARP request, as it has the associated IP configured on its interface, and responds itself. By contrast, when the switch port is configured as an access port, and it should flood the cr's ARP frame to all other ports in the vlan, it does not seem to do so. So the problem seems to be on the switch, specifically in forwarding ARP broadcasts that arrive on this port.

In the previous test we observed the cr could populate the arp entry for 198.18.0.2 when the ARP was generated on sretest1001. In other words the switch is forwarding broadcasts to the CR correctly, it's just not forwarding the ones it receives from it. Further we know that unicast frames sent out by the CR are correctly forwarded by the switch, as the ICMP echo replies made it back to sretest1001 without problem.

IPv6 tests

The same issue was observed with IPv6 neighbour discovery. When we try to ping the test IP on sretest1001 the CR generates neighbor solicitation ICMPv6 multicasts and sends them on the interface, but they are not forwarded to hosts on the attached vlan, and no responses are received:

cmooney@re0.cr1-eqiad> monitor traffic interface et-1/1/3 no-resolve size 1500 layer2-headers matching icmp6                   
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is OFF.
Listening on et-1/1/3, capture size 1500 bytes

19:39:40.409088 Out 5c:5e:ab:3d:81:a8 > 33:33:ff:00:00:02, ethertype IPv6 (0x86dd), length 86: 2001:db8::1 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2001:db8::2, length 32
19:39:41.409162 Out 5c:5e:ab:3d:81:a8 > 33:33:ff:00:00:02, ethertype IPv6 (0x86dd), length 86: 2001:db8::1 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2001:db8::2, length 32

But again if we initiate things from the host side neighbour discovery completes, and the ping works:

cmooney@sretest1001:~$ ping -6 -I 2001:db8::2 2001:db8::1
PING 2001:db8::1(2001:db8::1) from 2001:db8::2 : 56 data bytes
64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=1.63 ms
64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.684 ms
cmooney@re0.cr1-eqiad> monitor traffic interface et-1/1/3 no-resolve size 1500 layer2-headers matching icmp6    
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is OFF.
Listening on et-1/1/3, capture size 1500 bytes

19:44:31.139234  In PFE proto 6 (ipv6): 2001:db8::2 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:db8::1, length 32
19:44:31.139293 Out 5c:5e:ab:3d:81:a8 > d0:94:66:5f:67:20, ethertype IPv6 (0x86dd), length 86: 2001:db8::1 > 2001:db8::2: ICMP6, neighbor advertisement, tgt is 2001:db8::1, length 32
19:44:31.140005  In PFE proto 6 (ipv6): 2001:db8::2 > 2001:db8::1: ICMP6, echo request, seq 1, length 64
19:44:31.140039 Out 5c:5e:ab:3d:81:a8 > d0:94:66:5f:67:20, ethertype IPv6 (0x86dd), length 118: 2001:db8::1 > 2001:db8::2: ICMP6, echo reply, seq 1, length 64
Next steps

The current virtual-chassis configuration on the row D switches complicates further testing or steps we might take to resolve. The issue does not appear to be a configuration problem, and is likely an internal problem with the devices hardware forwarding table/state. A reboot of the device would likely clear this, however with the virutal-chassis config this is highly disruptive and take down all traffic for the row.

For the time being we will open a TAC case with Juniper to get their advice on the situation and what we might be able to do.

For now connectivity between cr1-eqiad and asw2-d-eqiad is back using the original 4x10G bundle, so there is not major urgency to resolve this. We also have the 40G port configured in a way that will allow us to test with JTAC and hopefully find a solution that isn't too painful.

As data point I tried:
asw2-d-eqiad# run request virtual-chassis vc-port set pic-slot 0 member 2 port 49
then
asw2-d-eqiad# run request virtual-chassis vc-port delete pic-slot 0 member 2 port 49

But that didn't solve the issue.
FPC2 reboot or upgrade might be the only fix.

Thanks @ayounsi, was worth a shot :)

I'm thinking we probably proceed as follows:

  1. Perform master switch flip from FPC 7 to FPC 2
request virtual-chassis routing-engine master switch
  1. Reboot FPC2
request system reboot member 2
  1. Reboot or upgrade entire VC.

If we get to step 3 do you think it's better to attempt the upgrade? Or just try a reboot? Quite likely either will clear out whatever the problem is, the upgrade I assume would take longer and perhaps bigger risk of some issue during the process?

Just a note that I should have added previously that Juniper wouldn't provide support due to JunOS 14.1 being out-of-support since 2018. So they closed the case.

That said I expect we probably would not have made great progress, and ultimately be left with the same options of rebooting etc.

I don't remember the impact of a switchover (eg. if it's none or tiny). So to be done carefully. At least the reboot is more explicit (but more impactful too).

If we get to step 3 do you think it's better to attempt the upgrade? Or just try a reboot? Quite likely either will clear out whatever the problem is, the upgrade I assume would take longer and perhaps bigger risk of some issue during the process?

If we have to reboot the whole row better to upgrade it as well, as the prep-work needed will be about the same for a 5min or 30 min downtime.

Thanks, looking at the config, and reading some docs, we have it set up so it should not have any impact:

cmooney@asw2-d-eqiad> show configuration | display set | match "nonstop|grace|synchronize"  
set system commit synchronize
set chassis redundancy graceful-switchover
set routing-options nonstop-routing

As you say, however, we need to proceed cautiously.

Seeing what happened with codfw row B, it's safe to assume that only a reboot of the faulty switch member will be needed and sufficient.

The impact of rebooting D2 (and maybe D7) to solve the parent task is to put in perspective compared to doing a full row upgrade, which might be needed later on anyway.

akosiaris subscribed.

Removing SRE, the more specific SRE team is already tagged.

ayounsi claimed this task.

With row D upgraded, I couldn't reproduce the issue.