- 16:02UTC - fpc2 qsfp-2/0/51 plugged in (from step Add: fpc2-fpc4, fpc5-fpc7 of T210447)
- Network logs around that time: https://logstash.wikimedia.org/app/kibana#/discover/40fc82e0-3c2f-11e8-a135-33a646d5ec16?_g=h@360b798&_a=h@898258f
Not sure this is the cause but fits very well.
- 16:05UTC - Spikes of multicast traffic showing up in many parts of the infra, eg.:
- cr1-codfw:ae1 inbound from asw-a-codfw: https://librenms.wikimedia.org/graphs/to=1545150600/id=8310/type=port_nupkts/from=1545148800/
- asw2-b-eqiad:ae1 Inbound from cr1-eqiad: https://librenms.wikimedia.org/graphs/to=1545150600/id=14626/type=port_bits/from=1545148800/
- cr1-codfw:ae2 inbound from asw-b-codfw: https://librenms.wikimedia.org/graphs/to=1545150600/id=8311/type=port_nupkts/from=1545148800/
- Causes links to saturate, routing protocols to failover
- 16:12 UTC first signs of issues related via IRC "Phab seems down", 503s "Request from xxx via cp3042 cp3042, Varnish XID 170596440 Error: 503, Backend fetch failed at Tue, 18 Dec 2018 16:11:45 GMT"
Then Icinga alerts for HTTP availability for Varnish in ulsfo, esams, etc.
- 16:12 UTC: start of purge traffic spike https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?panelId=6&fullscreen&orgId=1&from=1545148800000&to=1545156000000&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5
- 16:15UTC - end of the multicast spikes
- 16:25UTC - first signs of recoveries
- 16:50UTC - end of purge traffic spike, most likely the time for buffers to empty
- 16:50UTC - Full recovery
- This is only looking at it on a network perspective, a different look on the app layer would be useful.
- Why other switch facing ports on cr1-codfw see an spike of *Inbound* multicast? If the source was asw-a-codfw, they should at least see some inbound
- routers tried to mitigate (rate limit) the multicast traffic: DDOS_PROTOCOL_VIOLATION_SET: Protocol resolve:mcast-v4 is violated at fpc 0 for 717 times, started at 2018-12-18 16:11:41 UTC
- Why this issue didn't happen on the previous recabling?
- There are no logs mentioning a storm on asw-a-codfw
- This shows that if the wrong conditions are met, this could impact the whole infrastructure