Page MenuHomePhabricator

Investigate network issues in codfw that caused 503 errors
Closed, ResolvedPublic

Description

Network issues in codfw caused 503 errors between 07:51 and 07:57. The errors affected codfw and the DCs routed via codfw (ulsfo and eqsin), but not esams and eqiad.

Other than the HTTP availability alerts based on the dashboard linked above, the only error reported by Icinga on IRC was:

07:52 <+icinga-wm> PROBLEM - Host alcyone is DOWN: PING CRITICAL - Packet loss = 100%
07:52 <+icinga-wm> RECOVERY - Host alcyone is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms

We suspect network troubles as @Joe noticed multiple connectivity-related criticals for codfw on Icinga web interface.

According to @elukey, cr2-codfw logged the following:

Nov  9 07:49:34  re0.cr2-codfw rpd[2428]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.205 (realm ospf-v2 xe-5/0/1.0 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)

and cr2-eqiad the following:

elukey@re0.cr2-eqiad> show log messages | match xe-3/2/3
Nov  9 07:49:34  re0.cr2-eqiad rpd[2126]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.204 (realm ospf-v2 xe-3/2/3.0 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Nov  9 07:49:34  re0.cr2-eqiad bfdd[1686]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1432, new state: down, interface: xe-3/2/3.0, peer addr: 208.80.153.204
Nov  9 07:49:34  re0.cr2-eqiad rpd[2126]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 208.80.153.204 (xe-3/2/3.0) removed due to: bfd state down
Nov  9 07:49:34  re0.cr2-eqiad rpd[2126]: RPD_OSPF_NBRUP: OSPF neighbor 208.80.153.204 (realm ospf-v2 xe-3/2/3.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Nov  9 07:49:34  re0.cr2-eqiad rpd[2126]: RPD_OSPF_NBRUP: OSPF neighbor 208.80.153.204 (realm ospf-v2 xe-3/2/3.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Nov  9 07:49:36  re0.cr2-eqiad rpd[2126]: RPD_PIM_NBRUP: Instance PIM.master: PIM new neighbor 208.80.153.204 interface xe-3/2/3.0
Nov  9 07:50:37  re0.cr2-eqiad bfdd[1686]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1432, new state: up, interface: xe-3/2/3.0, peer addr: 208.80.153.204
Nov  9 07:51:10  re0.cr2-eqiad rpd[2126]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 198.35.26.193 (External AS 65004) changed state from OpenConfirm to Established (event RecvKeepAlive) (instance master)

Telia performed maintenance this morning on IC-307235, possibly overlapping with Equinix work on Internet Exchange 17915019.

Related Objects

Event Timeline

ema triaged this task as Medium priority.Nov 9 2018, 12:35 PM
elukey updated the task description. (Show Details)
Aklapper renamed this task from Investiagate network issues in codfw that caused 503 errors to Investigate network issues in codfw that caused 503 errors.Nov 9 2018, 12:59 PM

Not answering the question but adding data points.

We periodically have similar links flap or cut, and this one doesn't seem different from the others.

Looking at logs in Kibana,
The link flap lasted less than a second at 07:49:34 (still enough for BFD/OSPF to notice it), which most likely triggered paths calculations.

BGP session between ulsfo and eqiad went down for almost a minute (07:50:36-07:51:10), maybe due to some packet loss during OSPF convergence. This shouldn't impact codfw/eqsin.
BP session between pfw3-eqiad and pfw3-codfw went down for ~30s

The Equinix sessions going up/down are to networks too small to cause such spikes.

Smokeping didn't notice any packet loss or drop in latency.

The spike of 503 actually starts once BGP and OSPF are stable.

The only logs matching the time window are on cr1-eqiad:

%-DDOS_PROTOCOL_VIOLATION_SET: Protocol Rejectv6:aggregate is violated at fpc 3 for 330 times, started at 2018-11-09 07:52:07 UTC
%-DDOS_PROTOCOL_VIOLATION_CLEAR: Protocol Rejectv6:aggregate has returned to normal. Violated at fpc 3 for 330 times, from 2018-11-09 07:52:07 UTC to 2018-11-09 07:52:07 UTC

Which would need to be investigated more.

Especially as this should have impacted esams as well if they are the cause of the errors.

My guess based on those clues, is that this link flap caused at least some traffic from eqiad to codfw to be blackholed. Most likely the time protocols re-converge properly.
The DDOS_PROTOCOL_VIOLATION Rejectv6:aggregate log means that traffic spiked toward a v6 aggregate prefix, that didn't have more specific prefixes.
Most likely 2620:0:860::/48 (codfw) disappeared from 2620:0:860::/46. And legit traffic triggered that log.
If that's correct, there are some suggestions to make such events less impactful, such as T167306.

to get more visibility on that we would need a route collector recording each router's routing table for future analysis.
I don't think anything more can be done in the short term, especially seeing the rareness of the event (link flap causing actual prod issues).