Page MenuHomePhabricator

Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw)
Closed, ResolvedPublic

Description

In codfw, HTCP purge traffic (multicast traffic) is being sent in most of our networks. For example Parsoid codfw boxes get HTCP purge traffic[1]. That is due to the switches broadcast (instead of multicast) to all of the boxes the HTCP traffic. It suffices that one box in an entire switch joins the multicast purge group and all boxes on that switch start getting the traffic. The solution is known and old. IGMP snooping. Up to junos 13.2[2] the default was to enable igmp-snooping on all vlans. However we are in 14.1 and that no longer holds true, hence it seems that we need to configure igmp snooping manually. Doing that is easy however it seems to break IPv6 multicast, breaking RA advertisements and hence IPv6 connectivity. That used to be a bug in the EX series but it reportedly has been fixed years ago. Not sure if it would make its resurgence in QFX5100. At the same time MLD snooping, the IGMP snooping equivalent in the world of IPv6, would solve the problem as well, and in fact it is the preferred solution, but while EX series seems to support, the QFX series seems to not yet.

Need to investigate this further and manage to enable IGMP snooping in codfw without breaking IPv6.

[1] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20codfw&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=04%2F22%2F2016%2000%3A00&ce=04%2F22%2F2016%2001%3A00&st=1461331938&g=network_report&z=large
[2] Note: IGMP snooping is enabled by default on the default VLAN only. With versions of Junos OS for the QFX Series previous to 13.2, IGMP snooping is enabled by default on all VLANs. https://www.juniper.net/documentation/en_US/junos14.1/topics/concept/igmp-snooping-qfx-series-overview.html

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fgiunchedi triaged this task as Medium priority.Apr 27 2016, 3:07 PM
akosiaris moved this task from This quarter to Backlog on the netops board.

FTR, this still holds true today. There isn't really any reason it should have been fixed, just noting it.

Some more information about this. After quite a bit of debugging I 've gathered the following facts

  • The issue is present across all CODFW rows as well as asw2-d-eqiad
  • The issue only manifests for the QFX5100 members of each switch. That is 2 members out of every switch. The EX4300 do NOT exhibit this behavior.
  • The issue only manifests for the VLAN(s) IGMP snooping was configured for.
  • IPv6 RAs (destination address MAC address 33:33:00:00:00:01 and destination IPv6 address ff02::1 seem to be the only packets blocked. These are link-local multicast packets. The rest of Neighbor discovery protocol seems to be working fine (at least the few simple cases I checked).
  • Nothing in the Juniper changelogs/KB points to this being a known issue (yet).

This is almost certainly a Juniper bug. asw-a/b/c/d-codfw currently run JunOS 14.1X53-D27.3 and asw2-d-eqiad runs 14.1X53-D35.3, the currently JTAC-recommended.

Since this issue has persisted across releases, is -thanks to @akosiaris' investigative work- easily reproducible and affects multiple independent devices of ours and even across datacenters (so different IPv6-RA speakers), my suggested course of action is to file a Juniper case for this. The other alternative would be to start upgrading to random JunOS releases, and potentially hit other bugs in the process, so I'd rather avoid that.

I tried doing so this morning, but unfortunately I'm unable to, because of our expired warranty. For asw-d-codfw, I'm getting:

Parent SN :  
Service :  SVC-CP-QFX5100S4
Contract ID :  0060415755
Contract End Date  :  29-October-2016
Our records indicate the serial number entered is out of warranty and the service contract has expired. Therefore the device is NOT entitled for technical support.

…and for asw2-d-eqiad I'm getting:

Our records indicate the serial number you entered is covered only under Hardware Warranty for repair or replacement. If you wish to proceed with a Tech case for repair or replacement, select “Yes for Waranty RMA Creation”.

If you feel you are receiving this message in error or you have purchased service recently, please open an Admin Case or Call Customer Care for assistance.

This is thus currently blocked on T147518.

faidon renamed this task from HTCP purges flood across CODFW to Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw).Nov 10 2016, 11:16 AM
faidon raised the priority of this task from Medium to High.
faidon updated the task description. (Show Details)
faidon mentioned this in Unknown Object (Task).Nov 10 2016, 11:22 AM

This has been opened with JTAC as case 2016-1125-0413.

JTAC thinks this may be PR957108:

Title: IPv6 neighbor discovery packets might get dropped after enabling IGMP snooping
Release Note: On EX4300 Series switches, IPv6 neighbor discovery multicast packets (e.g. neighbor solicitation packets) might get dropped in a VLAN which has IGMP snooping enabled.
Severity: Major
Status: Closed
Last Modified: 2015-04-24 02:23:14 PDT
Resolved In: 13.2X51-D15 15.1R1
Product: EX Series

The description fits perfectly, however it's marked as affecting the EX series, but not QFX series, which is the exact opposite of what we're seeing in our setup.

JTAC recommended to move to 15.1 and see if we can reproduce. I am holding off for now in order to find out if a fix is coming for the (JTAC-recommended) 14.1 train. Depends on how that goes, we might be in for an upgrade.

Mentioned in SAL (#wikimedia-operations) [2016-12-01T13:21:35Z] <paravoid> Upgrading asw2-d-eqiad to JunOS 15.1R5 (T133387)

So first of all, JTAC said there is no ETA for this fix getting into 14.1 and we should really go with 15.1.

So, I tried upgrading to 15.1R5.5 today, which worked on EX4300 but failed on QFX5100s with:

ERROR: Cannot install jinstall-qfx-5-15.1R5.5-domestic-signed(bare-metal image) on qfx5100-48s-6q VM
ERROR: Use JUNOS VM package

I rebooted anyway and the JunOS mismatch broke the virtual chassis, for obvious reasons.

After spending quite a bit of time trying to debug this mysterious error message, I realized that 15.1R5.5 was released… yesterday; I examined this .tgz and another QFX one's .tgz and they looked quite different inside.

I then falled back to 15.1R4.6, which worked as expected on both EX/QFX, so that's what we're sticking with for now.

Unfortunately 15.1R4.6 has not solved the problem. Just managed to reproduce it with the exact same procedure and results. That is enabling IGMP snooping on vlan private1-d-eqiad and tcpdump on a 10G port of the 2 QFX members. Both would not receive the IPv6 RA for that vlan but would continue to receive RAs for the other VLANs

I responded to Juniper with the results of the above test, it's back with them now…

After a few back and forths with JTAC, the case was escalated to the Advanced TAC (aka ATAC). The issue was thankfully replicated in their lab.

This has now been raised as PR 1238906 (internal, not visibly externally until it's fixed) to Juniper's Engineering department and they are currently investigating.

Just in:

Engineering has fixed PR 1238906 has been fixed through master PR 1205416, and the fix would be available 14.1X53-D42 onwards, scheduled to release tentatively on Feb 14,2017.

faidon changed the task status from Open to Stalled.Jan 25 2017, 11:12 AM

Mentioned in SAL (#wikimedia-operations) [2017-03-22T10:09:44Z] <akosiaris> cr1-eqiad: set ae4 and members to disable. T133387

Mentioned in SAL (#wikimedia-operations) [2017-03-22T10:15:55Z] <akosiaris> Upgrading asw2-d-eqiad to JunOS 14.1X53 (T133387)

Mentioned in SAL (#wikimedia-operations) [2017-03-22T10:53:02Z] <akosiaris> cr1-eqiad: set ae4 and members to enable again. T133387

Mentioned in SAL (#wikimedia-operations) [2017-03-22T11:15:20Z] <akosiaris> Enable IGMP snooping for private1-d-eqiad. T133387

Mentioned in SAL (#wikimedia-operations) [2017-03-22T11:15:27Z] <akosiaris> Enable IGMP snooping for private1-d-eqiad on asw2-d. T133387

akosiaris changed the task status from Stalled to Open.Mar 22 2017, 11:38 AM

The upgrade to 14.1X53-D42.3 seems to have resolved the issue, at least partially. That is, during tcpdumps before enabling IGMP snooping on private1-d-eqiad I was able to see IPv6 RAs on both access and trunk ports

access:

11:02:52.784907 84:18:88:0d:df:c4 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::1 > ff02::1: ICMP6, router advertisement, length 56

trunk:

11:02:44.811243 84:18:88:0d:df:c4 > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 114: vlan 1004, p 0, ethertype IPv6, fe80::1 > ff02::1: ICMP6, router advertisement, length 56
11:02:52.785621 84:18:88:0d:df:c4 > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 114: vlan 1020, p 0, ethertype IPv6, fe80::1 > ff02::1: ICMP6, router advertisement, length 56

note that in the trunk case RAs for both VLANs 1004 and 1020 are captured by tcpdump

After enabling IGMP snooping on private1-d-eqiad the situation remains unchanged for access ports

access:

11:36:07.879985 84:18:88:0d:df:c4 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::1 > ff02::1: ICMP6, router advertisement, length 56

However the IPv6 RAs on the trunk port are no longer there for the private1-d-eqiad vlan (vlanid 1020)
trunk:

11:34:58.917630 84:18:88:0d:df:c4 > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 114: vlan 1004, p 0, ethertype IPv6, fe80::1 > ff02::1: ICMP6, router advertisement, length 56

In our topology, that's fine, I think, given we don't have any reliance on trunk ports passing through that kind of traffic. But overall I would say part of the bug still exists. Correct me if I am wrong though.

I am setting this to Open from Stalled as we can probably proceed without this being a blocker anymore

akosiaris changed the task status from Open to Stalled.Mar 22 2017, 11:43 AM

Resetting to stalled status for 24 hours after some discussions on IRC

So the situation is very unclear and quite messy, which is why I wanted to try this (thanks @akosiaris for tackling this):

  • We previously heard that 14.1X53-D42.3 would include the fix.
  • After that, we heard from ATAC that "At this point, the fix has been moved out of 14.1X53-D42 as the code dependency issue still persists and Engineering is looking at 14.1X53-D50 to have the patch ported to. I will update once Engineering has added the fix to a main Junos release." and no ETA for 14.1X53-D50.
  • PR1205416 doesn't list a 14.1 release under "Resolved In", only "15.1X53-D51 15.1X53-D55 15.1R6 16.1R4"
  • 14.1X53-D42.3's release notes do list PR1205416 under "issues fixed".
  • ATAC on the phone said that they see PR1205416 as resolved in both 14.1X53-D42 and 14.1X53-D50.
  • None of 15.1X53-D51 15.1X53-D55 15.1R6 16.1R4 have been released yet.

I just got off the phone with JTAC, as my previous emails have been ignored for over a month now :/ The agent (Praveen) promised to get back to me in at most 24 hours with specifically which releases fix the issue and when are those scheduled to be released. There is a chance this may help clear the situation, so let's wait for that. (Separately, I asked and got an escalation contact to complain for the very poor support we've been getting)

I haven't heard back, but I noticed PR1205416 now says:

Resolved In: 14.1X53-D42 15.1X53-D51 15.1X53-D55 15.1R6 16.1R4

So I think this may be resolved in a version that is available now \o/

Let's enable IGMP snooping across all VLANs and make a note to upgrade JunOS on all other switches at some point in the future.

Mentioned in SAL (#wikimedia-operations) [2017-03-28T08:29:05Z] <akosiaris> enable IGMP snooping on all VLANs on asw2-d-eqiad. T133387

Done. I 've deleted and vlan default entry as well as the manually added (by me) private1-d-eqiad and added the all VLAN.

show protocols igmp-snooping 
vlan all;

ICMPv6 RAs still flow as expected through the interfaces.

@faidon: Should I schedule an upgrade of all the codfw switches ? After the switchover, in order to avoid adding a potential unknown ? Or like this week, allowing for any bugs to appear in the next 3 weeks before the failover ?

akosiaris changed the task status from Stalled to Open.Mar 28 2017, 8:55 AM

We've lived with this bug in codfw for so long, I'd say to let it be as-is until we're done with the switchover and postpone that for May onwards.

The latest from Juniper:

Faidon,

I just got more information on this Case.

The current PR tracking this issue is 1238906. Which is in testing process for the next release, D43.
The D42 solution was retracted by Engineering due to failures in other dependencies.

I’m trying to (1) get confirmation on when will D43 be released and (2) when will we have confirmation that this fix is fully tested for D43.

I understand that might is not be the answer you want to hear, but I’m trying to be as open as possible for you to make the most informed decision.

@akosiaris said that he tested the D42 release and it did fix the issue, so I'm a bit puzzled here. @ayounsi, this may or may not affect the plans around row D/asw2-d-eqiad. Let's discuss tomorrow.

The latest from Juniper:

Faidon,

I just got more information on this Case.

The current PR tracking this issue is 1238906. Which is in testing process for the next release, D43.
The D42 solution was retracted by Engineering due to failures in other dependencies.

I’m trying to (1) get confirmation on when will D43 be released and (2) when will we have confirmation that this fix is fully tested for D43.

I understand that might is not be the answer you want to hear, but I’m trying to be as open as possible for you to make the most informed decision.

@akosiaris said that he tested the D42 release and it did fix the issue, so I'm a bit puzzled here. @ayounsi, this may or may not affect the plans around row D/asw2-d-eqiad. Let's discuss tomorrow.

Partially fixed indeed as noted in T133387#3121483, which might be a symptom of the issues that lead to the retraction.

This just got in:

The fix for PR 1238906 has implemented in the main Junos release and would be available 14.1X53-D43 onwards. This release is tentatively scheduled to release on 12 May. With the fix, the QFX5100 forwards the RA from the server even when igmp-snooping is configured on the host vlan. I have also verified the fix in the lab.

On a side note, not relying on IPv6 RA, and using static routes/IPs (see T102099) on at least the nodes that use IGMP snooping would work around this issue.

On a side note, not relying on IPv6 RA, and using static routes/IPs (see T102099) on at least the nodes that use IGMP snooping would work around this issue.

I'm not entirely sure if this would affect only the RAs or also NDP in general. We probably shouldn't play with our luck and just take the hit of multicast floods until we get rid of multicast or until we upgrade to a newer JunOS…

14.1X53-D43 seems to have been released on May 11th. This particular PR isn't mentioned on the release notes, so the fix may or may not be included. I've dropped ATAC a note to confirm.

After another round with ATAC, this is the latest:

PR 1238906 is the original PR for this issue and it was raised by me. This is fixed starting 14.1X53-D43 onwards and I have verified it in my lab as well.

During the investigation, PR 1238906 was considered to be a duplicate of PR 1205416 by Engineering and hence, we had to wait till PR 1205416 was resolved. Since the resolution of PR 1205416 didn’t fix our issue, PR 1238906 was no longer marked a duplicate of this PR and was being handled separately again. So fix of PR 1205416 in 14.1X53-42 would not resolve our issue.

Now, about the release notes on PR 1238906, I can only publish the content in the release notes section. However, the “Resolved In” template in the external link is automated by a software and does not always list all the fixed versions. I have verified again today that the fix has been committed to 14.1X530-D43 for PR 1238906.

I think next step would be to try 14.1X53-D43. I think we can start with asw-esams first? We'll need to confirm that the issue is present there first, as the setup is a little different than of that in our other sites.

Mentioned in SAL (#wikimedia-operations) [2017-05-24T08:49:35Z] <akosiaris> depool cp3036 for T133387 testing

Change 355395 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Drain esams for network maintenance

https://gerrit.wikimedia.org/r/355395

Change 355395 merged by Alexandros Kosiaris:
[operations/dns@master] Drain esams for network maintenance

https://gerrit.wikimedia.org/r/355395

Mentioned in SAL (#wikimedia-operations) [2017-05-24T09:10:46Z] <akosiaris> drain esams for network tests for T133387

Mentioned in SAL (#wikimedia-operations) [2017-05-24T09:53:26Z] <XioNoX> rebooting asw-esams for upgrade (T133387)

Mentioned in SAL (#wikimedia-operations) [2017-05-24T10:05:38Z] <volans> forcing puppet run on failed hosts only in esams T133387

A tcpdump -vvvv -ttt -i eth0 icmp6 and 'ip6[40] = 134' on cp3036 shows RAs still being received by the box with igmp-snooping enabled on all vlans after the upgrade. Before the upgrade that would not work. Same thing on lvs3003 which has a trunk interface on the switch (that would not work, see T133387#3121483) . Seems like the bug is resolved.

Change 355413 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Revert "Drain esams for network maintenance"

https://gerrit.wikimedia.org/r/355413

Change 355413 merged by Alexandros Kosiaris:
[operations/dns@master] Revert "Drain esams for network maintenance"

https://gerrit.wikimedia.org/r/355413

@ayounsi, what's the current status of this task? Last update is from over a year ago, but I think some of our latest woes with asw2-b-eqiad are very much interrelated to this?

No real update since a year ago. All switch stacks have been upgraded to a version that doesn't have this specific bug (14.1X53-D43.7) except asw2-d-eqiad (still on 14.1X53-D42.3, see T172459).

I don't know for sure if it's the same issue as T201039, but it's definitely very similar.

asw2-b-eqiad is running 14.1X53-D46.7, and possible fix is in 14.1X53-D47.
So either a different bug, or it got re-introduced a few versions later...

I just looked briefly at T172459 and it looks like the last update there was to attempt this during the switchover period which is obviously over :)

All those tasks are interrelated and overlapping in a way -- there's no point in keeping those (and separately!) open to do the upgrades in some indeterminate point in the future.

What are your thoughts on potential next steps, and estimates on when they can/should happen, if at all?

As this was the troubleshooting task and follow up tasks have been open for the software upgrades themselves, it's fine to close that one.

DBAs don't have enough cycles to help with any eqiad network maintenance this quarter (as moving DB masters is very time consuming), so those will be moved to at least Q1 and might need to be set as a DBA goal as well (maybe with some hardware refresh?)

The lack of igmp-snooping is not a blocker, but something that would be great to fix if the downside (row downtime) is not too heavy.

Also we're not sure yet that the most recent Junos version (running on asw2-a) fixes the issue.
As the symptoms are sporadic it's not something possible to test with only a few hosts, so taking a downtime to upgrade a switch stack and still not fix the issue would mean having to take another downtime.

The eqiad current tldr is:
row A: waiting to have hosts moved from old to new switch stack, on proper VCF topology, igmp-snooping enabled
row B: hosts on new switch stack (except cloud to be moved shortly), on proper VCF topology, igmp-snooping disabled
row C: hosts on new switch stack, need VCF recabling, igmp-snooping disabled
row D: hosts on new switch stack, need VCF recabling and QFX addition, igmp-snooping disabled

So a potential order would be:
1/ Move hosts to new A to see if Junos version indeed solve the igmp-snooping issue (few sec downtime per host)
2/ Decide if we want to upgrade the other rows
3/ If so, do the upgrade at the same time as the re-cabling for C/D where needed