Page MenuHomePhabricator

Spike of multicast traffic
Open, HighPublic

Description

Timeline:

Not sure this is the cause but fits very well.

Then Icinga alerts for HTTP availability for Varnish in ulsfo, esams, etc.

Some notes:

  • This is only looking at it on a network perspective, a different look on the app layer would be useful.
  • Why other switch facing ports on cr1-codfw see an spike of *Inbound* multicast? If the source was asw-a-codfw, they should at least see some inbound
  • routers tried to mitigate (rate limit) the multicast traffic: DDOS_PROTOCOL_VIOLATION_SET: Protocol resolve:mcast-v4 is violated at fpc 0 for 717 times, started at 2018-12-18 16:11:41 UTC
  • Why this issue didn't happen on the previous recabling?
  • There are no logs mentioning a storm on asw-a-codfw
  • This shows that if the wrong conditions are met, this could impact the whole infrastructure

Event Timeline

ayounsi triaged this task as High priority.Dec 19 2018, 2:15 AM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptDec 19 2018, 2:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey added a subscriber: elukey.Dec 19 2018, 10:25 AM

My guess so far is that the recabling triggered a bug in Junos VCF which caused a multicast storm that got propagated to all listeners, filling up links and exhausting resources.
After some research, there are some knobs we can tune to:
1/ Prevent router's exhaustion by applying more strict DDoS thresholds
2/ Prevent links saturation by applying Multicast Bandwidth Maximums or firewall policer on key interfaces (access/core and core-core links).
Eyeballing the graphs, 300Mbps seems like a conservative value to use, fine tuning it would be ideal.

For reference here is the spike seen on the primary eqiad-codfw link:
pps: https://librenms.wikimedia.org/graphs/id=8197/type=port_nupkts/to=1545157800/from=1545145200/
bps: https://librenms.wikimedia.org/graphs/id=8197/type=port_bits/to=1545157800/from=1545145200/

As the packets are small (we don't do video) it would be ideal to rate limit the pps instead of bps, but I can't find any option for that.

This would ideally need a lab to be tested though.