Page MenuHomePhabricator

Spike of multicast traffic
Open, HighPublic



Not sure this is the cause but fits very well.

Then Icinga alerts for HTTP availability for Varnish in ulsfo, esams, etc.

Some notes:

  • This is only looking at it on a network perspective, a different look on the app layer would be useful.
  • Why other switch facing ports on cr1-codfw see an spike of *Inbound* multicast? If the source was asw-a-codfw, they should at least see some inbound
  • routers tried to mitigate (rate limit) the multicast traffic: DDOS_PROTOCOL_VIOLATION_SET: Protocol resolve:mcast-v4 is violated at fpc 0 for 717 times, started at 2018-12-18 16:11:41 UTC
  • Why this issue didn't happen on the previous recabling?
  • There are no logs mentioning a storm on asw-a-codfw
  • This shows that if the wrong conditions are met, this could impact the whole infrastructure

Event Timeline

ayounsi triaged this task as High priority.Dec 19 2018, 2:15 AM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptDec 19 2018, 2:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey added a subscriber: elukey.Dec 19 2018, 10:25 AM

My guess so far is that the recabling triggered a bug in Junos VCF which caused a multicast storm that got propagated to all listeners, filling up links and exhausting resources.
After some research, there are some knobs we can tune to:
1/ Prevent router's exhaustion by applying more strict DDoS thresholds
2/ Prevent links saturation by applying Multicast Bandwidth Maximums or firewall policer on key interfaces (access/core and core-core links).
Eyeballing the graphs, 300Mbps seems like a conservative value to use, fine tuning it would be ideal.

For reference here is the spike seen on the primary eqiad-codfw link:

As the packets are small (we don't do video) it would be ideal to rate limit the pps instead of bps, but I can't find any option for that.

This would ideally need a lab to be tested though.