Page MenuHomePhabricator

IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46
Closed, DeclinedPublic

Description

Since earlier today several hosts stopped being able to receive arp packets from the other.

Two of those hosts for example are in the same vlan, same fabric, but different members of that same fabric.
There are no firewall rules blocking that traffic, or other security features blocking that traffic.
No changes have been made to the switch fabric or the hosts before the issue started.

The two hosts I'm testing it with are elastic1049 (10.64.16.111) and elastic1038 (10.64.16.47), which are in the same vlan (/22)

No pings from one way to the other:

$ elastic1049:~$ ping 10.64.16.47
PING 10.64.16.47 (10.64.16.47) 56(84) bytes of data.
^C
--- 10.64.16.47 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2039ms
elastic1038:~$ ping 10.64.16.111
PING 10.64.16.111 (10.64.16.111) 56(84) bytes of data.
^C
--- 10.64.16.111 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1002ms

Only one way ARPing:

ayounsi@elastic1049:~$ sudo arping 10.64.16.47
ARPING 10.64.16.47
60 bytes from 14:02:ec:06:9e:dc (10.64.16.47): index=0 time=3.124 msec
60 bytes from 14:02:ec:06:9e:dc (10.64.16.47): index=1 time=15.394 msec
60 bytes from 14:02:ec:06:9e:dc (10.64.16.47): index=2 time=15.131 msec
60 bytes from 14:02:ec:06:9e:dc (10.64.16.47): index=3 time=6.643 msec
^C
--- 10.64.16.47 statistics ---
4 packets transmitted, 4 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 3.124/10.073/15.394/5.337 ms
elastic1038:~$ sudo arping 10.64.16.111
ARPING 10.64.16.111
Timeout
Timeout
Timeout
Timeout
^C
--- 10.64.16.111 statistics ---
5 packets transmitted, 0 packets received, 100% unanswered (0 extra)

While elastic1049 (10.64.16.111) sees the ARP requests from elastic1038 (10.64.16.47) and replies to them, but elastic1038 (10.64.16.47) never sees the replies.

elastic1049:~$ sudo tcpdump arp host 10.64.16.47 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:18:05.542508 ARP, Request who-has 10.64.16.111 tell 10.64.16.47, length 46
15:18:05.542524 ARP, Reply 10.64.16.111 is-at 94:18:82:6f:18:18, length 28
15:18:06.542697 ARP, Request who-has 10.64.16.111 tell 10.64.16.47, length 46
15:18:06.542715 ARP, Reply 10.64.16.111 is-at 94:18:82:6f:18:18, length 28
15:18:07.542722 ARP, Request who-has 10.64.16.111 tell 10.64.16.47, length 46
15:18:07.542737 ARP, Reply 10.64.16.111 is-at 94:18:82:6f:18:18, length 28
elastic1038:~$ sudo tcpdump arp host 10.64.16.111
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:23:09.342561 ARP, Request who-has elastic1049.eqiad.wmnet tell elastic1038.eqiad.wmnet, length 28
15:23:09.588770 ARP, Request who-has elastic1049.eqiad.wmnet tell elastic1038.eqiad.wmnet, length 28
15:23:10.342820 ARP, Request who-has elastic1049.eqiad.wmnet tell elastic1038.eqiad.wmnet, length 28
15:23:10.588922 ARP, Request who-has elastic1049.eqiad.wmnet tell elastic1038.eqiad.wmnet, length 28

This looks like a VCF issue to me.

High priority case 2018-0802-0511 opened with Juniper.

Event Timeline

ayounsi triaged this task as High priority.Aug 2 2018, 5:09 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 2 2018, 5:09 PM
ayounsi added subscribers: mark, dcausse, Gehel, BBlack.
Restricted Application added a project: Operations. · View Herald TranscriptAug 2 2018, 5:10 PM
Paladox added a subscriber: Paladox.Aug 2 2018, 5:11 PM

Bouncing the network ports of elastic1049 and elastic1038, solved the issue.

Mentioned in SAL (#wikimedia-operations) [2018-08-02T18:15:14Z] <gehel> un-banning and repooling elastic1030 - T201039

So... what's the status of this? What else has been observed, what has been done to troubleshoot and what's the latest from Juniper? I tried to
access the Juniper case for more insight, but unfortunately I don't seem to have the right permissions to access this case (unrelated to this task and low-prio, but perhaps @ayounsi or @RobH can work with Juniper to figure out why?)

CCed you to the JTAC case, not sure how to make sure you have default access to all the cases.

So far poor replies from JTAC, I'll escalate if it doesn't get proper response.

Did the same issue happen again?

JTAC came back with troubleshooting and data gathering commands/configuration to do if the issue happen again.

Here they are:

1/ Add logs statements to know if the packets are getting in/out of the fabric

Filter on the ingress interface:

set firewall family ethernet-switching filter icmp-filter1 term 1 from source-address 10.71.84.109/32 >>>>you need to change the IP
set firewall family ethernet-switching filter icmp-filter1 term 1 from destination-address 10.71.85.254/32>>>>> you need to change the IP
set firewall family ethernet-switching filter icmp-filter1 term 1 from protocol icmp
set firewall family ethernet-switching filter icmp-filter1 term 1 then accept
set firewall family ethernet-switching filter icmp-filter1 term 1 then count icmp-counter1
set firewall family ethernet-switching filter icmp-filter1 term default then accept

Filter on the engress interface:
set firewall family ethernet-switching filter icmp-filter2 term 1 from source-address 10.71.85.254/32>>>>> you need to change the IP
set firewall family ethernet-switching filter icmp-filter2 term 1 from destination-address 10.71.85.254/32>>>>> you need to change the IP
set firewall family ethernet-switching filter icmp-filter2 term 1 from protocol icmp
set firewall family ethernet-switching filter icmp-filter2 term 1 then accept
set firewall family ethernet-switching filter icmp-filter2 term 1 then count icmp-counter2
set firewall family ethernet-switching filter icmp-filter2 term default then accept

2/ Perform a commit full

3/ configure an irb

First we can configured an irb interface and link to that VLAN with an IP address and then try to ping the host. With this we can isolate the issue to the irb. Also with the irb we can see if the arp are entering the RE of the device. (This is a none intrusive test and the best way to test due that the device is in production)

4/ Move ports, but if a down/up fix the issue, moving the host will as well

Also if this is possible, we can move those hosts to another port in the same member with the same vlan to see if the issue present. This way we can isolate the member that is causing conflict.

5/ Change vlans

Also we can create a vlan for test and see move a host and a port to this vlan in case the vlan is the one that is miss-programming and try to ping.

6/ Reboot device
Finally we can program a MW for a reboot of the device, since this is usually fix all the miss-programming issues.

jcrespo added a subscriber: jcrespo.EditedAug 7 2018, 7:59 AM

See T201139#4483590, probably more relevant here (disconnection between a B1 and a B4 host).

I think es1014 issue gone away (according to grafana)?

Still no good for me (at least between prometheus1004 and es1014).

Provided all the requested info to Juniper and their answer so far is "bounce the port" which solved the issue previously but is definitely not an acceptable, long term, solution.

ayounsi added a comment.EditedAug 8 2018, 5:51 PM

Disabled the VC link between fpc4 and fpc5 to reduce the density of links (cf. T201145#4486602).

EDIT: rolled back

JTAC dug down into the VCF to confirm that it was a miss-programming issue.
It's also non-trivial to list all hosts potentially having the same issue.

To fix it, other than down/up the interface, is to clear the mac table on the interface, which can cause a second of downtime.

Root cause can be the non-standard topology or:
They also noticed that one of the package on the QFX members was running an older version:

Hostname: asw2-b-eqiad
Model: qfx5100-48s-6q
Junos: 14.1X53-D46.7
JUNOS Base OS boot [14.1X53-D46.7]
JUNOS Base OS Software Suite [14.1X53-D46.7]
JUNOS Crypto Software Suite [14.1X53-D46.7]
JUNOS Online Documentation [14.1X53-D46.7]
JUNOS Kernel Software Suite [14.1X53-D46.7]
JUNOS Packet Forwarding Engine Support (qfx-ex-x86-32) [14.1X53-D46.7]
JUNOS Routing Software Suite [14.1X53-D46.7]
JUNOS SDN Software Suite [14.1X53-D46.7]
JUNOS Enterprise Software Suite [14.1X53-D46.7]
JUNOS Web Management Platform Package [14.1X53-D46.7]
JUNOS py-base-i386 [14.1X53-D46.7]
JUNOS Host Software [14.1X53-D35.3]    <--------

To fix that, the software upgrade command needs to have the keyword force-host.
From the doc:

force-host—(Optional) Force the addition of host software package or bundle (ignore warnings) on the QFX5100 device.

and

Note: On QFX5100 and EX4600 switches, the Host OS is not upgraded automatically, so you must use the force-host option if you want the Junos OS and Host OS versions to be the same.
However, pay attention to these notes regarding Junos OS and Host OS versions:

The Junos OS and Host OS versions do not need to be the same.
During an ISSU, the Host OS cannot be upgraded.
Upgrading the Host OS is not required for every software upgrade, as noted above.

Are we still good for Thursday at 16:00 UTC for row B?

ayounsi added a comment.EditedOct 3 2018, 3:07 PM

Still good, here is the list of hosts currently on the new asw2-b-eqiad that will be impacted by Thursday 4th 16:00UTC 2h maintenance window (with a worse case of a 30min downtime for those hosts, and a best case of no impact). Will add the step by step changes needed shortly.

analytics1046 
analytics1047 
analytics1048 
analytics1049 
analytics1050 
analytics1051 
analytics1061 
analytics1062 
analytics1063 
analytics1072
analytics1073
an-coord1001
an-master1002
aqs1008
authdns1001
cloudelastic1002
cloudnet1003 eth0
cloudnet1003 eth1
cloudservices1003
cloudvirt1023 eth0
cloudvirt1023 eth1
cloudvirt1024 eth0
cloudvirt1024 eth1
conf1005
cp1079
cp1080
cp1081
cp1082
db1072
db1073
db1076
db1077
db1083
db1084
db1085
db1086
db1098
db1099
db1104
db1112
db1113
db1118
db1119
db1124
dbproxy1004
dbproxy1005
dbproxy1006
dbproxy1014
dbproxy1015
druid1005
elastic1028
elastic1036
elastic1037
elastic1038
elastic1039
elastic1046
elastic1047
elastic1049
elastic1050
es1013
es1014
graphite1004
iron
kafka1002
kafka-jumbo1003
kubernetes1002
kubestage1002
labcontrol1004
labnet1001:eth0
labnet1001:eth1
labnet1002:eth0
labnet1002:eth1
labnet1004 eth0
labnet1004 eth1
labvirt1015-eth0
labvirt1015-eth1
labvirt1019:eth0
labvirt1019:eth1
labvirt1020 eth0
labvirt1020 eth1
labvirt1021:eth0
labvirt1021:eth1
labvirt1022:eth0
labvirt1022:eth1
labweb1001
logstash1005
lvs1001:eth1
lvs1002:eth1
lvs1003:eth1 
lvs1004
lvs1005
lvs1006
lvs1015:enp4s0f1
lvs1016:enp5s0f0 {#3931}
maps1002
mc1024
mc1025
mc1026
mc1027
ms-be1016
ms-be1017
ms-be1018
ms-be1020
ms-be-1022
ms-be-1023
ms-be1031
ms-be1032
ms-be1034
ms-be1041
mw1284
mw1285
mw1286
mw1287
mw1288
mw1289
mw1290       
mw1293
mw1294
mw1295
mw1296
mw1297
mw1298
mw1299
mw1300
mw1301
mw1302
mw1303
mw1304
mw1305
mw1306
mw1313
mw1314
mw1315
mw1316
mw1317
mw1318
mwmaint1002  
notebook1003
ores1003
ores1004
phab1001
prometheus1004
promethium
rdb1004
rdb1009
restbase-dev1005
rhodium
ripe atlas
ruthenium    
scb1002
snapshot1008
thumbnor1001
thumbnor1002
wdqs1007
wdqs1009
wtp1031
wtp1032
wtp1033
wtp1034
wtp1035
ayounsi added a comment.EditedOct 3 2018, 5:47 PM

Steps to migrate asw2-b-eqiad to a supported topology.

Those steps are what I think would prevent or minimize downtime, but due to the nature of the VC fabric, it's not possible to ensure that it will be stable during the re-cabling.

If instability arises the best is to finish the re-cabling as the final state has been confirmed to be stable with asw2-a.

Step 1)

  • Enable all VC ports on FPC2 and FPC7
request virtual-chassis vc-port set pic-slot 0 port 51 member 2
request virtual-chassis vc-port set pic-slot 0 port 52 member 2
request virtual-chassis vc-port set pic-slot 0 port 53 member 2
request virtual-chassis vc-port set pic-slot 0 port 51 member 7
request virtual-chassis vc-port set pic-slot 0 port 52 member 7
request virtual-chassis vc-port set pic-slot 0 port 53 member 7
  • Disconnect the following:
    • fpc1-fpc3
    • fpc1-fpc8
    • fpc3-fpc4
    • fpc3-fpc5
    • fpc4-fpc5
    • fpc4-fpc6
    • fpc6-fpc8
  • Connect fpc5:1/1-fpc7:0/51 (5m DAC)
  • Enable fpc5-fpc7
request virtual-chassis vc-port set pic-slot 1 port 3 member 5
  • Disable fpc5-fpc6

request virtual-chassis vc-port delete pic-slot 1 port 1 member 6

Step 2)

  • Connect/enable fpc2:0/51-fpc5:1/0 (5m DAC)
request virtual-chassis vc-port set pic-slot 0 port 51 member 2
request virtual-chassis vc-port set pic-slot 1 port 0 member 5

Step 3)

  • Add missing links
  • fpc1:1/1-fpc7 (40G optics + fiber)
  • fpc3:1/0-fpc7 (7m DAC)
  • fpc6:1/0-fpc2 (7m DAC)
  • Confirm all links are working (except fpc8-fpc2)
  • Replace fpc4-fpc7 with 5M DAC
  • Add last link: fpc8-fpc2 (40G optics + fiber)
  • Delete VC ports not used anymore (leafs only) (exact ports depend on previous cabling)

The following hosts (aside from the ones above) will need to be downtimed too:
db1117, db2042 and db2078 (they replicate from db1072 and db1073)
db2037 (replicates from db)
labsdb1009, labsdb1010 and labsdb1011 (they replicate from db1124)

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:41:46Z] <elukey> depool kafka1002 from eventbus as precautionary step for T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:56:23Z] <arturo> icinga downtime every server with the cloudXXXX scheme for 2h T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:58:33Z] <arturo> icinga downtime every server in the main cloudvps deployment for 2h T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T16:00:12Z] <marostegui> Stop MySQL on db1073 for mariadb and kernel upgrade - T201039 T148507

Mentioned in SAL (#wikimedia-operations) [2018-10-04T16:13:57Z] <XioNoX> starting asw2-b-eqiad re-cabling - T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T16:14:21Z] <XioNoX> Enable all VC ports on FPC2 and FPC7 - T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T16:39:42Z] <XioNoX> Enable fpc5-fpc7 - T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T16:41:22Z] <XioNoX> Connect/enable fpc2:0/51-fpc5:1/0 (5m DAC) - T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T16:52:00Z] <XioNoX> Step 3) Add missing links - T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-04T17:38:44Z] <XioNoX> asw2-b-eqiad recabling done - T201039

ayounsi added a subscriber: Cmjohnson.EditedOct 4 2018, 6:48 PM

Some post-maintenance notes:

  • Need new optics are needed to connect fpc8 to fpc2, (all spares have been used) - opened T206340
  • Some members went briefly (<30s) offline during the re-cabling, taking down servers connected to them:

17:02:04 fpc8
~17:05: fpc7 and what was temporarily single homed to it (fpc6/8)
17:09:08 fpc6
17:10:21 fpc6

  • IPv6 ND is not working between lvs1002 and phab1002 (which looks similar to the issues that led to the creation of that task)

lvs1002 has been depooled, asw2-b's switch port to lvs1002 has been bounced with no success, next step is to bounce phab1001's switch port.

Mentioned in SAL (#wikimedia-operations) [2018-10-04T21:27:46Z] <XioNoX> bounce phab1001 switch port - T201039

ayounsi mentioned this in Unknown Object (Task).Oct 5 2018, 4:07 PM

Opened Juniper case 2018-1005-0549 about the ND issue.

ema added a subscriber: ema.Oct 8 2018, 12:56 PM

cp1081 and cp1079, both on asw2-b-eqiad, are also having IPv6 connectivity issues with lvs1001:

12:53:09 ema@lvs1001.wikimedia.org:~
$ curl  http://localhost:9090/pools/textlb6_443
cp1081.eqiad.wmnet:	enabled/down/not pooled
cp1083.eqiad.wmnet:	enabled/up/pooled
cp1085.eqiad.wmnet:	enabled/up/pooled
cp1087.eqiad.wmnet:	enabled/up/pooled
cp1075.eqiad.wmnet:	enabled/up/pooled
cp1079.eqiad.wmnet:	enabled/down/not pooled
cp1089.eqiad.wmnet:	enabled/up/pooled
cp1077.eqiad.wmnet:	enabled/up/pooled

Everything looks fine IPv4-wise:

12:53:16 ema@lvs1001.wikimedia.org:~
$ curl  http://localhost:9090/pools/textlb_443
cp1081.eqiad.wmnet:	enabled/up/pooled
cp1083.eqiad.wmnet:	enabled/up/pooled
cp1085.eqiad.wmnet:	enabled/up/pooled
cp1087.eqiad.wmnet:	enabled/up/pooled
cp1075.eqiad.wmnet:	enabled/up/pooled
cp1079.eqiad.wmnet:	enabled/up/pooled
cp1089.eqiad.wmnet:	enabled/up/pooled
cp1077.eqiad.wmnet:	enabled/up/pooled

Change 465161 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Revert "Revert "traffic: Depool eqiad from user traffic for switchover""

https://gerrit.wikimedia.org/r/465161

Mentioned in SAL (#wikimedia-operations) [2018-10-08T13:09:19Z] <ema> depool eqiad front-edge traffic T201039

Change 465161 merged by Ema:
[operations/dns@master] Revert "Revert "traffic: Depool eqiad from user traffic for switchover""

https://gerrit.wikimedia.org/r/465161

ayounsi added a comment.EditedOct 8 2018, 4:27 PM

Working with JTAC on this.

Here is a tcpdump capture of a neighbor solicitation packet being sent from lvs1002:

lvs1002:~$ sudo tcpdump -p -i eth1.1018 icmp6 -vvv -nn
tcpdump: listening on eth1.1018, link-type EN10MB (Ethernet), capture size 262144 bytes
15:23:44.333359 IP6 (flowlabel 0xf2e7c, hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::1a03:73ff:fef0:8ede > ff02::1:ff16:100: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2620:0:861:102:10:64:16:100
	  source link-address option (1), length 8 (1): 18:03:73:f0:8e:de
	    0x0000:  1803 73f0 8ede

Confirming that phab1001 is subscribed to the NS multicast group:

phab1001:~$ netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
[...]
eth0            1      ff02::1:ff16:100

Going to push the following to the two interfaces to check if indeed the packets are making it into the fabric but not going out:

[edit interfaces ge-4/0/14]
+    unit 0 {
+        family ethernet-switching {
+            filter {
+                output v6-ns-phab1001;
+            }
+        }
+    }
[edit interfaces ge-6/0/46]
+    unit 0 {
+        family ethernet-switching {
+            filter {
+                input v6-ns-lvs1002;
+            }
+        }
+    }
[edit]
+  firewall {
+      family ethernet-switching {
+          filter v6-ns-lvs1002 {
+              interface-specific;
+              term 1 {
+                  from {
+                      protocol icmp6;
+                      ip-version {
+                          ipv6 {
+                              ip6-source-address {
+                                  fe80::1a03:73ff:fef0:8ede/128;
+                              }
+                              ip6-destination-address {
+                                  ff02::1:ff16:100/128;
+                              }
+                          }
+                      }
+                  }
+                  then {
+                      accept;          
+                      count v6-ns-lvs1002;
+                  }
+              }
+              term default {
+                  then accept;
+              }
+          }
+          filter v6-ns-phab1001 {
+              interface-specific;
+              term 1 {
+                  from {
+                      protocol icmp6;
+                  }
+                  then {
+                      accept;
+                      count v6-ns-phab1001;
+                  }
+              }
+              term default {
+                  then accept;
+              }
+          }
+      }
+  }

EDIT:
making the filter broader as:

[edit interfaces ge-4/0/14 unit 0 family ethernet-switching]
  'filter'
    Referenced filter 'v6-ns-phab1001' can not be used as ip-version not supported on egress
error: configuration check-out failed

Mentioned in SAL (#wikimedia-operations) [2018-10-08T16:29:31Z] <XioNoX> push firewall filter counters on asw2-b-eqiad - T201039

Followed up with JTAC, we can see the NS packets making it into the fabric:

# run show firewall    
Filter: v6-ns-lvs1002-ge-6/0/46.0-i                            
Counters:
Name                                                Bytes              Packets
v6-ns-lvs1002-ge-6/0/46.0-i                           282                    3

The phab1001 side though can't filter on egress v6 IPs, and filtering on ICMPv6 is too broad.

Filter: v6-ns-phab1001-ge-4/0/14.0-o                           
Counters:
Name                                                Bytes              Packets
v6-ns-phab1001-ge-4/0/14.0-o                    122999323               100225

tcpdump on the destination still doesn't show the Neighbor Solicitation packets.

Mentioned in SAL (#wikimedia-operations) [2018-10-08T19:41:50Z] <XioNoX> troubleshooting asw2-b-eqid with JTAC - T201039

ayounsi added a comment.EditedOct 8 2018, 7:43 PM

Temporarily disable IGMP snooping on the interfaces to narrow down the issue.

[edit protocols igmp-snooping vlan all]
+     interface ge-6/0/46.0 {
+         multicast-router-interface;
+     }
+     interface ge-4/0/14.0 {
+         multicast-router-interface;
+     }

Edit: this fixed the issue for those specific hosts, even after reverting the change, those two hosts can do neighbor discovery between each others.

JTAC is looking for similar issues in their database.
Edit: possible match PR1263535, fixed in Junos 14.1X53-D47.

cp1081 and cp1079, both on asw2-b-eqiad, are also having IPv6 connectivity issues with lvs1001:

I can ping them both now from lvs1001, so not sure if the change above also fixed that or something else did.

faidon added a comment.Oct 9 2018, 1:40 AM

This sounds a lot like T133387, which we reported a while back and had ATAC and engineering involved...

Mentioned in SAL (#wikimedia-operations) [2018-10-09T09:00:47Z] <ema> re-enable puppet/pybal on lvs1002, IPv6 connectivity with phab1001 working again T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-09T19:25:02Z] <XioNoX> disable igmp-snooping on asw2-b-eqiad - T201039

Mentioned in SAL (#wikimedia-operations) [2018-10-09T19:37:12Z] <XioNoX> disable igmp-snooping on asw2-c-eqiad - T201039

ayounsi mentioned this in Unknown Object (Task).Oct 16 2018, 8:28 AM
ayounsi renamed this task from connectivity issues between several hosts on asw2-b-eqiad to IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46.Jan 16 2019, 11:37 PM
ayounsi lowered the priority of this task from High to Medium.

Reducing priority as the situation is stable.
At this point I don't think the cost of upgrading the switch stacks of row B and C (full row down for ~15min) is worth the advantage of enabling IGMP snooping.

Current impact of disabling IGMP snooping means all hosts gets an extra ~1.5k Packets/s most likely dropped by the NIC. Which is not negligible but non impacting neither.

@faidon @BBlack thoughts?

ayounsi lowered the priority of this task from Medium to Low.Jan 17 2019, 4:56 PM

Discussed it with Brandon, it's still something we want to fix but is now low priority.
We will probably have to wait for the next DC failover or another more urgent reason to upgrade.

Just to give an idea followup of es1014, issue seem gone:

jynus@prometheus1004:~$ ping es1014.eqiad.wmnet
PING es1014.eqiad.wmnet (10.64.16.187) 56(84) bytes of data.
64 bytes from es1014.eqiad.wmnet (10.64.16.187): icmp_seq=1 ttl=64 time=0.129 ms
64 bytes from es1014.eqiad.wmnet (10.64.16.187): icmp_seq=2 ttl=64 time=0.157 ms
64 bytes from es1014.eqiad.wmnet (10.64.16.187): icmp_seq=3 ttl=64 time=0.160 ms
RobH removed a subscriber: RobH.Mar 3 2020, 6:14 PM
ayounsi closed this task as Declined.Thu, Jul 16, 8:32 AM

IGMP snooping removed from all switches with T257573