Page MenuHomePhabricator

Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad
Closed, ResolvedPublic

Description

We brought some new Elastic hosts into service in T309810 .

Unfortunately, there seems to be communication issues.

Hosts in rack F2 elastic10(98|99) cannot reach the hosts in rack F3 elastic11(00|01|02).

Inter-cluster communication for these hosts takes place over tcp/9500, and packet captures show both hosts trying to contact each other on that port, with outbound requests to port 9500 never showing up on the destination.

ICMP ping and curls to port 9400*, which work between other hosts in the cluster, do not work between the F2 and F3 hosts. Until this is resolved, we cannot allow the affected hosts to join the cluster.

Thanks for your time, please let us know if you need more info.

*For testing purposes, elastic1100 is the only F3 host listening on 9400, but ping should be sufficient to verify communication.

Event Timeline

bking updated the task description. (Show Details)
RKemper renamed this task from Possible problem communicating between racks F2 and F3 in EQIAD to Possible problem communicating between eqiad elastic hosts in racks F2 and F3.Aug 11 2022, 6:16 PM
RKemper updated the task description. (Show Details)

We check the ferm rules, which seem to open those ports as expected. I suspect there is something going on at a lower networking level, we'll need help from Traffic to check what that is.

ayounsi triaged this task as High priority.Aug 11 2022, 7:44 PM
ayounsi added a project: netops.
ayounsi added a subscriber: cmooney.

I had a quick look and can't find any smoking gun so far.

The issue seems to be related to vxlan between lsw1-f2 and lsw1-f3 (all vlans or loopback between those 2 devices). It can be narrowed down to:
lsw1-f2-eqiad> ping routing-instance PRODUCTION 10.64.146.9 source 10.64.146.8
lsw1-f2-eqiad:lo0.1 to lsw1-f3-eqiad:lo0.1

For example this works
lsw1-f2-eqiad> ping routing-instance PRODUCTION 10.64.146.4 source 10.64.146.8
lsw1-f2-eqiad:lo0.1 to lsw1-e2-eqiad:lo0.1

Underlay seems fine between the two devices's loopback:

PRODUCTION.inet.0: 50 destinations, 52 routes (50 active, 0 holddown, 0 hidden)

10.64.146.9/32 (1 entry, 1 announced)
TSI:
KRT in-kernel 10.64.146.9/32 -> {composite(1740)}
        *EVPN   Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0xc6865f4
                Next-hop reference count: 12
                Next hop type: Router, Next hop index: 1738
                Next hop: 10.64.129.22 via et-0/0/55.0, selected
                Session Id: 0x0
                Protocol next hop: 10.64.128.9
                Composite next hop: 0xbd31030 1740 INH Session ID: 0x0
                  VXLAN tunnel rewrite:
                    MTU: 0, Flags: 0x0
                    Encap table ID: 0, Decap table ID: 10
                    Encap VNI: 3005000, Decap VNI: 3005000
                    Source VTEP: 10.64.128.8, Destination VTEP: 10.64.128.9
                    SMAC: a4:e1:1a:81:53:80, DMAC: a4:e1:1a:81:e9:80
                Indirect next hop: 0xc814904 524290 INH Session ID: 0x0
                State: <Active Int Ext VxlanLocalRT>
                Local AS: 64810 
                Age: 15w2d 7:58:53 	Metric2: 16 
                Validation State: unverified 
                Task: PRODUCTION-EVPN-L3-context
                Announcement bits (1): 2-KRT 
                AS path: I  (Originator)
                Cluster list:  10.64.128.0
                Originator ID: 10.64.128.9
                Thread: junos-main 
                Composite next hops: 1
                        Protocol next hop: 10.64.128.9 Metric: 16
                        Composite next hop: 0xbd31030 1740 INH Session ID: 0x0
                          VXLAN tunnel rewrite:
                            MTU: 0, Flags: 0x0
                            Encap table ID: 0, Decap table ID: 10
                            Encap VNI: 3005000, Decap VNI: 3005000
                            Source VTEP: 10.64.128.8, Destination VTEP: 10.64.128.9
                            SMAC: a4:e1:1a:81:53:80, DMAC: a4:e1:1a:81:e9:80
                        Indirect next hop: 0xc814904 524290 INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.64.129.22 via et-0/0/55.0
                                Session Id: 0x0
                                10.64.128.9/32 Originating RIB: inet.0
                                  Metric: 16 Node path count: 1
                                  Forwarding nexthops: 1
                                        Next hop type: Router
                                        Next hop: 10.64.129.22 via et-0/0/55.0
                                        Session Id: 0x0
lsw1-f2-eqiad> ping 10.64.128.9 source 10.64.128.8    
PING 10.64.128.9 (10.64.128.9): 56 data bytes
64 bytes from 10.64.128.9: icmp_seq=0 ttl=63 time=0.935 ms
64 bytes from 10.64.128.9: icmp_seq=1 ttl=63 time=0.610 ms
64 bytes from 10.64.128.9: icmp_seq=2 ttl=63 time=0.764 ms

One surprising point though is that the path through the 2nd spine doesn't show up (only through f1).

Thanks @ayounsi

One surprising point though is that the path through the 2nd spine doesn't show up (only through f1).

Link from lsw1-e1 to lsw1-f3 is currently down. Light on the 4 lanes is good but lsw1-f3 reports a local fault on et-0/0/54. Not related I think, but that is why you only see one next-hop used for traffic between them. See T315052.

I've been unable to find a smoking gun here either. Problem most definitely is with VXLAN encap traffic from lsw1-f3 to lsw1-f2 (or other direction). Both VTEP devices can send traffic to/from other devices, for instance IRB ints on lsw1-e3 or hosts hanging off that switch. So it's not a complete failure of VXLAN on F2/F3. For example:

cmooney@elastic1094:~$ sudo traceroute -I -w 1 elastic1099
traceroute to elastic1099 (10.64.135.6), 30 hops max, 60 byte packets
 1  irb-1033.lsw1-e3-eqiad.eqiad.wmnet (10.64.132.1)  4.862 ms  4.839 ms  4.831 ms
 2  irb-1036.lsw1-f2-eqiad.eqiad.wmnet (10.64.135.1)  3.583 ms  3.577 ms  3.571 ms
 3  elastic1099.eqiad.wmnet (10.64.135.6)  0.129 ms * *
cmooney@elastic1094:~$ sudo traceroute -I -w 1 elastic1100
traceroute to elastic1100 (10.64.136.4), 30 hops max, 60 byte packets
 1  irb-1033.lsw1-e3-eqiad.eqiad.wmnet (10.64.132.1)  6.191 ms  6.174 ms  6.168 ms
 2  irb-1037.lsw1-f3-eqiad.eqiad.wmnet (10.64.136.1)  6.865 ms  6.861 ms  6.857 ms
 3  elastic1100.eqiad.wmnet (10.64.136.4)  0.201 ms * *

I've double checked all the route and forwarding tables, and PFE entries, and they look ok. Resolved next-hop groups are the same. VNIs are equal, RMACs shown for encap are in the EVPN database for the equivalent IRBs on the other side, routes are imported correctly into the production.inet.0 table. Comparison of the two device configs also doesn't show any differences, what should be the same is, and the identifiers that ought to be different are.

I've cleared the EVPN BGP sessions from both Leaf devices to try to force reprogramming of the routing tables but no change in behaviour.

I'll open a TAC case with Juniper in the morning and continue troubleshooting.

cmooney renamed this task from Possible problem communicating between eqiad elastic hosts in racks F2 and F3 to Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad.Aug 12 2022, 1:35 AM

I ended up issuing this command:

request app-engine service restart packet-forwarding-engine

On the back of seeing this problem described, which I can't say for certain didn't happen sometime:

https://supportportal.juniper.net/s/article/QFX-Traffic-drops-after-changing-lo0-address-on-QFX5110-QFX5120-platforms-with-EVPN-VXLAN?language=en_US

And it seems to have fixed it:

cmooney@elastic1100:~$ ping -4 elastic1099
PING  (10.64.135.6) 56(84) bytes of data.
64 bytes from elastic1099.eqiad.wmnet (10.64.135.6): icmp_seq=1 ttl=62 time=0.134 ms
64 bytes from elastic1099.eqiad.wmnet (10.64.135.6): icmp_seq=2 ttl=62 time=0.141 ms
64 bytes from elastic1099.eqiad.wmnet (10.64.135.6): icmp_seq=3 ttl=62 time=0.170 ms
cmooney@elastic1100:~$ sudo traceroute -I -w 1 -4 elastic1098
traceroute to elastic1098 (10.64.135.5), 30 hops max, 60 byte packets
 1  irb-1037.lsw1-f3-eqiad.eqiad.wmnet (10.64.136.1)  8.524 ms  8.507 ms  8.500 ms
 2  irb-1036.lsw1-f2-eqiad.eqiad.wmnet (10.64.135.1)  6.544 ms  6.540 ms  6.536 ms
 3  elastic1098.eqiad.wmnet (10.64.135.5)  0.134 ms * *

Only issue is that it caused a short disruption to traffic in racks F2/F3. The article didn't seem to suggest it was intrusive, but I should have been more cautious. Traffic was affected for approx 2 minutes starting 02:08 UTC.

I'll still follow up to Juniper to see what they say.

(Following is just related to bringing these hosts back into service)

Pooled the hosts:

ryankemper@puppetmaster1001:~$ sudo confctl select name=elastic110.* set/pooled=yes:weight=10
The selector you chose has selected the following objects:
{"/eqiad/elasticsearch/elasticsearch": ["elastic1102.eqiad.wmnet", "elastic1101.eqiad.wmnet", "elastic1100.eqiad.wmnet"], "/eqiad/elasticsearch/elasticsearch-omega-ssl": ["elastic1100.eqiad.wmnet"], "/eqiad/elasticsearch/elasticsearch-psi-ssl": ["elastic1102.eqiad.wmnet", "elastic1101.eqiad.wmnet"], "/eqiad/elasticsearch/elasticsearch-ssl": ["elastic1101.eqiad.wmnet", "elastic1102.eqiad.wmnet", "elastic1100.eqiad.wmnet"]}
Ok to continue? [y/N]
confctl>y
eqiad/elasticsearch/elasticsearch/elastic1102.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch/elastic1102.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch/elastic1101.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch/elastic1101.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch/elastic1100.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch/elastic1100.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch-omega-ssl/elastic1100.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch-omega-ssl/elastic1100.eqiad.wmnet: weight changed 10 => 10
eqiad/elasticsearch/elasticsearch-psi-ssl/elastic1102.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch-psi-ssl/elastic1102.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch-psi-ssl/elastic1101.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch-psi-ssl/elastic1101.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch-ssl/elastic1101.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch-ssl/elastic1101.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch-ssl/elastic1102.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch-ssl/elastic1102.eqiad.wmnet: weight changed 0 => 10
eqiad/elasticsearch/elasticsearch-ssl/elastic1100.eqiad.wmnet: pooled changed no => yes
eqiad/elasticsearch/elasticsearch-ssl/elastic1100.eqiad.wmnet: weight changed 0 => 10
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*

Re-enabled puppet:
ryankemper@cumin1001:~$ sudo -E cumin 'elastic110*' 'run-puppet-agent --force'

We can confirm things are working from the Search Platform point of view. No more work for the Search Platform team, so unassigning us. @cmooney I'll let you close unless you want to keep it to follow up with our vendor.

cmooney lowered the priority of this task from High to Low.Aug 19 2022, 4:09 PM

Thanks yep case opened with JTAC now will keep it open to document any information they may provide.

So after quite a bit of back-and-forth with Juniper and pulling logs etc. they say they can't see anything in the logs provided to shed light on the issue.

They confirmed all routing tables etc. look as expected, and config is good, and expect the issue is some buggy behaviour in the PFE. They say they have some internal EVPN diagnostic commands for the PFE they can run if it happens but they cannot share them with us, and advise to open a TAC again if it happens.

I advised that should it happen again it's not really an option for us to leave our network in a problematic state - when restarting the PFE should correct - for an extended period while we work through a case until it is escalated up to someone who knows what to do.

Closing this now as there isn't much else we can do.