Page MenuHomePhabricator

Investigate Juniper storm control
Closed, ResolvedPublic

Description

Juniper's Storm Control feature could be useful to mitigate outages like the one we had on Thursday 13th where a loop on a leaf switch impacted the spine (msw1-eqiad) and thus the whole management network.

Storm control should alert and/or shutdown a port if a large amount of inbound BUM traffic is detected.

The goal here would be to investigate:

  • If that's indeed how storm control works
  • If the default values are good enough for us or should be tuned
  • What the best notification strategy should be
  • Deploying the configuration
  • Eventually testing it

https://www.juniper.net/documentation/en_US/junos/topics/concept/rate-limiting-storm-control-understanding.html

Event Timeline

ayounsi triaged this task as Medium priority.Feb 13 2020, 7:06 PM
ayounsi created this task.
ayounsi updated the task description. (Show Details)

Reading from librenms on ms1-eqiad

  • normal operation Total traffic on ge-0/0/14 which is connected to msw-b2-eqiad

IN = 96.11MB
OUT = 40.92MB

  • During the outage: 2020-02-13 00:01 to 2020-02-13 23:59

IN = 381.25MB
OUT = 643.46MB

When the switch is operating normally the total traffic is about 97MB IN and 50MB OUT and this is on ge-0/0/14 (msw-b2-eqiad)
If storm control was set with the default value of 15000Kbps this should have prevent the outage but we need to test and see if it is true.

Note
As a side note Storm control is not setup on msw1-eqiad

Next step is to create a single profile named "wmf-mgmt-storm", configure a storm control bandwidth of 15,000Kbps and add all the interfaces except the interface connected to mr1 (ge-0/0/32) for both sites.

@ayounsi please see below for the configuration i just added the first interface for now. If all looks good I will add the other interfaces

[edit interfaces ge-0/0/0 unit 0 family ethernet-switching]
+       storm-control wmf-mgmt-storm;
[edit]
+  forwarding-options {
+      storm-control-profiles wmf-mgmt-storm {
+          all {
+              bandwidth-level 15000;
+          }
+      }
+  }
papaul@msw1-codfw# show storm-control-profiles wmf-mgmt-storm
all {
    bandwidth-level 15000;
}

Looks good! Instead of manually applying the profile to each interface I think we should refactor them and use interface-range like we do on access switches.
That range could be named access-switches for example.

Only the interface to mr1 would remain on its own without the storm control config.

I created the interface range mgmt-switches, added interfaces ge-0/0/0 to ge-0/0/31 to it and bind the storm control profile wmf-mgmt-storm to it.

papaul@msw1-codfw# show | compare
[edit interfaces]
    interface-range disabled { ... }
+   interface-range mgmt-switches {
+       member-range ge-0/0/0 to ge-0/0/31;
+       description storm_control_interface;
+       unit 0 {
+           family ethernet-switching {
+               storm-control wmf-mgmt-storm;
+           }
+       }
+   }

Looking good!

So the description storm_control_interface; is not needed here as the individual interface descriptions have the priority.
Now that we have the interface-range, we can clean up the redundant configuration which mean removing all the interfaces xxx unit 0 family ethernet-switching of the interfaces covered by the two existing interface-range (yours and the existing disabled one) (make sure to use commit confirmed).

Once the configuration looks good, we can:

  • Check the logs that nothing triggers storm-control
  • Add the action-shutdown
  • (ideally) test it.
  • Removed all interfaces xxx unit 0 family ethernet-switching of the interfaces covered by the two existing interface-range (yours and the existing disabled one)
  • Commit using commit confirmed
  • left
ge-0/0/32 {
    description "mr1-codfw:ge-0/0/0 {#10710} [1Gbps Cu]";
    unit 0 {
        family ethernet-switching;
    }
}
ge-0/0/33 {
    description msw1-codfw:vme;
    unit 0 {
        family ethernet-switching;
    }
}

LGTM!

I forgot one step: write doc :)

  • Documentation in place
  • Add action-shutdown

@ayounsi for the restore process when the interface is shutdown do you want for us to setup a recovery timeout or manually restore the interface?
if we are going to use the recovery timeout what should be the time ?

Thanks. Manual action is better here to prevent flapping.

If all good, change the alert target so it notifies the whole of SRE

This is done too. And I added the alert to https://wikitech.wikimedia.org/wiki/Network_monitoring#Storm_control_in_effect

Do you want to configure msw1-eqiad as well (different syntax, as it's an older switch) or wait for its replacement with T225121 ?

I think it is better to do it when the new msw1 is in place. No need to do it now on the old msw1-eqiad

ayounsi changed the task status from Open to Stalled.Apr 30 2020, 3:26 PM

Stalling the task until we either:

  • can start doing more intrusive testing to see if it works as expected
  • msw1-eqiad is replaced with T225121

msw1-eqiad is replaced with T225121

This is done, and is running the same storm-control config as codfw.

Papaul subscribed.
ayounsi claimed this task.

All done here. I don't think it's worth doing more intrusive testing.